 Review of XT6 Architecture
    AMD Opteron
    Cray Networks
    Lustre Basics
 Programming Environment
    PGI Compiler Basics
    The Cray Compiler Environment
    Cray Scientific Libraries
    Cray Message Passing Toolkit
 Cray Performance Analysis Tools
 ATP
 CCM
 Optimizations
    CPU
    Communication
    I/O
                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   2
AMD CPU Architecture
   Cray Architecture
Lustre Filesystem Basics




  2011 HPCMP User Group © Cray Inc.   June 20, 2011   3
2011 HPCMP User Group © Cray Inc.   June 20, 2011   4
2003          2005             2007               2008               2009         2010
               AMD           AMD
                                           “Barcelona”        “Shanghai”         “Istanbul”   “Magny-Cours”
             Opteron™      Opteron™

      Mfg.
             130nm SOI     90nm SOI          65nm SOI          45nm SOI          45nm SOI       45nm SOI
   Process
                 K8            K8            Greyhound       Greyhound+         Greyhound+     Greyhound+
CPU Core


    L2/L3      1MB/0         1MB/0          512kB/2MB         512kB/6MB          512kB/6MB     512kB/12MB

     Hyper
Transport™   3x 1.6GT/.s   3x 1.6GT/.s       3x 2GT/s         3x 4.0GT/s         3x 4.8GT/s    4x 6.4GT/s
Technology

   Memory    2x DDR1 300   2x DDR1 400     2x DDR2 667        2x DDR2 800       2x DDR2 800   4x DDR3 1333




                                    2011 HPCMP User Group © Cray Inc.   June 20, 2011                      5
12 cores
                                                             1.7-2.2Ghz
1   4              7                    10                   105.6Gflops

                                                             8 cores
    5                                   11                   1.8-2.4Ghz
2                  8                                         76.8Gflops

                                                             Power (ACP)
3   6              9                    12                   80Watts

                                                             Stream
                                                             27.5GB/s

                                                             Cache
                                                             12x 64KB L1
                                                             12x 512KB L2
                                                             12MB L3


        2011 HPCMP User Group © Cray Inc.    June 20, 2011                  6
L3 cache




                                                                                                                   HT Link
                                                     HT Link
                                                               HT Link
HT Link



          L2 cache                       L2 cache                        L2 cache                       L2 cache




                                                                                    MEMORY CONTROLLER
           Core 2    MEMORY CONTROLLER    Core 5                          Core 8                        Core 11




                                                               HT Link




                                                                                                                   HT Link
HT Link




                                                     HT Link
          L2 cache                       L2 cache                        L2 cache                       L2 cache

           Core 1                         Core 4                         Core 7                         Core 10

          L2 cache                       L2 cache                        L2 cache                       L2 cache

           Core 0                         Core 3                          Core 6                        Core 9




                                           2011 HPCMP User Group © Cray Inc.        June 20, 2011                            7
 A cache line is 64B
 Unique L1 and L2 cache attached to each core
    L1 cache is 64 kbytes
    L2 cache is 512 kbytes
 L3 Cache is shared between 6 cores
    Cache is a “victim cache”
    All loads go to L1 immediately and get evicted down the caches
 Hardware prefetcher detects forward and backward strides through
  memory
 Each core can perform a 128b add and 128b multiply per clock cycle
    This requires SSE, packed instructions
    “Stride-one vectorization”
 6 cores share a “flat” memory
    Non-uniform-memory-access (NUMA) beyond a node
                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   8
Processor   Frequency     Peak     Bandwidth       Balance
                        (Gflops)    (GB/sec)     (bytes/flop
                                                      )
Istanbul
               2.6       62.4         12.8           0.21
  (XT5)
               2.0        64         42.6           0.67

 MC-8          2.3       73.6        42.6           0.58

               2.4       76.8        42.6            0.55

               1.9        91.2       42.6           0.47

 MC-12         2.1       100.8       42.6           0.42

               2.2       105.6       42.6           0.40




                                    2011 HPCMP User Group © Cray Inc.   June 20, 2011   9
Gemini (XE-series)




                     2011 HPCMP User Group © Cray Inc.   June 20, 2011   10
 Microkernel on Compute PEs,
                                                  full featured Linux on Service
                                                  PEs.
                                                 Service PEs specialize by
                                                  function
                      Compute PE
                                                 Software Architecture
                      Login PE                    eliminates OS “Jitter”
                      Network PE                 Software Architecture enables
                                                  reproducible run times
                      System PE
                                                 Large machines boot in under
                      I/O PE                      30 minutes, including
                                                  filesystem
Service Partition
Specialized
Linux nodes




                    2011 HPCMP User Group © Cray Inc.   June 20, 2011         11
XE6
   System




                                                   External
                                                 Login Server



                             Boot RAID          10 GbE
IB QDR




            2011 HPCMP User Group © Cray Inc.      June 20, 2011   13
6.4 GB/sec direct connect
       Characteristics                         HyperTransport
Number of       16 or 24 (MC)
Cores               32 (IL)
Peak            153 Gflops/sec
Performance
MC-8 (2.4)
Peak            211 Gflops/sec
Performance
MC-12 (2.2)
Memory Size     32 or 64 GB per
                     node
Memory           83.5 GB/sec
Bandwidth                                                                           83.5 GB/sec direct
                                                                                    connect memory
                                                 Cray
                                               SeaStar2+
                                             Interconnect



                                2011 HPCMP User Group © Cray Inc.   June 20, 2011                    14
Greyhound                                    Greyhound
                                    Greyhound                                    Greyhound   DDR3 Channel
     DDR3 Channel    6MB L3                         HT3           6MB L3
                                    Greyhound                                    Greyhound
                     Cache          Greyhound                     Cache          Greyhound
                                    Greyhound                                    Greyhound

     DDR3 Channel                   Greyhound                                    Greyhound   DDR3 Channel



                              HT3




                                                                           HT3
                                    Greyhound       H                            Greyhound


     DDR3 Channel    6MB L3
                                    Greyhound
                                    Greyhound
                                                    T3            6MB L3
                                                                                 Greyhound
                                                                                 Greyhound
                                                                                             DDR3 Channel

                     Cache          Greyhound                     Cache          Greyhound
                                    Greyhound                                    Greyhound
                                    Greyhound       HT3                          Greyhound   DDR3 Channel
     DDR3 Channel




                                                                   To Interconnect
                                                                       HT1 / HT3
 2 Multi-Chip Modules, 4 Opteron Dies
 8 Channels of DDR3 Bandwidth to 8 DIMMs
 24 (or 16) Computational Cores, 24 MB of L3 cache
 Dies are fully connected with HT3
 Snoop Filter Feature Allows 4 Die SMP to scale well

                                     2011 HPCMP User Group © Cray Inc.   June 20, 2011                      15
Without snoop filter, a streams test
                             shows 25MB/sec out of a possible
                             51.2 GB/sec or 48% of peak
                             bandwidth




2011 HPCMP User Group © Cray Inc.   June 20, 2011             16
With snoop filter, a streams test
                             shows 42.3 MB/sec out of a
                             possible 51.2 GB/sec or 82% of
                             peak bandwidth




                  This feature will be key for two-
                 socket Magny Cours Nodes which
                  are the same architecture-wise
2011 HPCMP User Group © Cray Inc.   June 20, 2011                17
 New compute blade with 8 AMD
  Magny Cours processors
 Plug-compatible with XT5 cabinets
  and backplanes
 Upgradeable to AMD’s
  “Interlagos” series
 XE6 systems ship with the current
  SIO blade




                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   18
2011 HPCMP User Group © Cray Inc.   June 20, 2011   19
 Supports 2 Nodes per ASIC
 168 GB/sec routing capacity
 Scales to over 100,000 network
    endpoints
   Link Level Reliability and Adaptive                         Hyper                  Hyper
    Routing                                                   Transport              Transport
                                                                  3                      3
   Advanced Resiliency Features
   Provides global address                                         NIC 0 Netlink      NIC 1
                                                                      SB
    space                                                                 Block           Gemini
                                         LO
   Advanced NIC designed to          Processor
    efficiently support                                                    48-Port
      MPI                                                               YARC Router
      One-sided MPI
      Shmem
      UPC, Coarray FORTRAN
                                2011 HPCMP User Group © Cray Inc.    June 20, 2011               20
Cray Baker Node
                                                                 Characteristics
                                                      Number of              16 or 24
10 12X Gemini                                         Cores
  Channels
                                                      Peak              140 or 210 Gflops/s
 (Each Gemini           High Radix
                       YARC Router
                                                      Performance
  acts like two
nodes on the 3D        with adaptive                  Memory Size        32 or 64 GB per
     Torus)               Routing
                                                                              node
                        168 GB/sec
                         capacity                     Memory                85 GB/sec
                                                      Bandwidth
                  2011 HPCMP User Group © Cray Inc.     June 20, 2011                   21
Module with
SeaStar


                                                                  Z

                                                                       Y

                                                                      X




Module with
Gemini
              2011 HPCMP User Group © Cray Inc.   June 20, 2011       22
net rsp
                                                                                                                net req




                                                                                                                                    LB
            ht treq p
                                                                              net                                         LB Ring
            ht treq np                                            FMA         req   T                               net
            ht trsp                                                                     net           net
                                                                                    A   req       S   req           req   vc0
                                                ht p req
                                                                              net   R             S
                                                                              req                           O
                                              ht np req                             B             I
                                                                  BTE                                       R       net
                                                                                                  D                 rsp
                                                                                                            B                vc1




                                                                                                                                          Router Tiles
 HT3 Cave




                                                                                                                                    NL
            ht irsp
                                        NPT                                                                               vc1

            ht np                                                                                                  net
            ireq                                                                                                   rsp
                                                 net req    CQ          NAT
                            ht np req
                        H
                            ht p req                                            net rsp headers
            ht p
                        A               AMO                       net
                            ht p req
            ireq        R                         net req         req                    net req                             vc0
                        B                                   RMT
                            ht p req
                                                                        RAT             net rsp




                                                                                                                                    LM
                                                                                                      CLM




 FMA (Fast Memory Access)
    Mechanism for most MPI transfers
    Supports tens of millions of MPI requests per second
 BTE (Block Transfer Engine)
    Supports asynchronous block transfers between local and remote memory,
     in either direction
    For use for large MPI transfers that happen in the background
                                                                  2011 HPCMP User Group © Cray Inc.                             June 20, 2011            23
 Two Gemini ASICs are
    packaged on a pin-compatible
    mezzanine card
   Topology is a 3-D torus
   Each lane of the torus is
    composed of 4 Gemini router
    “tiles”
   Systems with SeaStar
    interconnects can be upgraded
    by swapping this card
   100% of the 48 router tiles on
    each Gemini chip are used



                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   24
2011 HPCMP User Group © Cray Inc.   June 20, 2011   28
Name       Architecture   Processor          Network             # Cores         Memory/Core

Jade       XT-4           AMD                Seastar 2.1         8584            2GB DDR2-800
                          Budapest (2.1
                          Ghz)
Einstein   XT-5           AMD                Seastar 2.1         12827           2GB (some
                          Shanghai (2.4                                          nodes have
                          Ghz)                                                   4GB/core)
                                                                                 DDR2-800
MRAP       XT-5           AMD                Seastar 2.1         10400           4GB DDR2-800
                          Barcelona (2.3
                          Ghz)
Garnet     XE-6           Magny Cours        Gemini 1.0          20160           2GB DDR3-1333
                          8 core 2.4 Ghz
Raptor     XE-6           Magny Cours        Gemini 1.0          43712           2GB DDR3-1333
                          8 core 2.4 Ghz
Chugach    XE-6           Magny Cours        Gemini 1.0          11648           2GB DDR3 -1333
                          8 core 2.3 Ghz




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011                    29
2011 HPCMP User Group © Cray Inc.   June 20, 2011   30
2011 HPCMP User Group © Cray Inc.   June 20, 2011   31
2011 HPCMP User Group © Cray Inc.   June 20, 2011   32
Low Velocity Airflow




                                     High Velocity Airflow




                                     Low Velocity Airflow




                                    High Velocity Airflow




2011 HPCMP User Group © Cray Inc.
June 20, 2011
33                                  Low Velocity Airflow
Cool air is released into the computer room




Liquid                                                                          Liquid/Vapor
in                                                                              Mixture out




                Hot air stream passes through evaporator, rejects
                  heat to R134a via liquid-vapor phase change
                                 (evaporation).


     R134a absorbs energy only in the presence of heated air.
     Phase change is 10x more efficient than pure water
     cooling.

                            2011 HPCMP User Group © Cray Inc.   June 20, 2011          34
R134a piping                                       Exit Evaporators




                                                           Inlet Evaporator



2011 HPCMP User Group © Cray Inc.   June 20, 2011                      35
2011 HPCMP User Group © Cray Inc.   June 20, 2011   36
Term           Meaning                                 Purpose
MDS            Metadata Server                         Manages all file metadata for
                                                       filesystem. 1 per FS
OST            Object Storage Target                   The basic “chunk” of data written
                                                       to disk. Max 160 per file.
OSS            Object Storage Server                   Communicates with disks,
                                                       manages 1 or more OSTs. 1 or
                                                       more per FS
Stripe Size    Size of chunks.                         Controls the size of file chunks
                                                       stored to OSTs. Can’t be changed
                                                       once file is written.
Stripe Count   Number of OSTs used per                 Controls parallelism of file. Can’t
               file.                                   be changed once file is writte.




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011               37
2011 HPCMP User Group © Cray Inc. une 20, 2011
                                J                38
2011 HPCMP User Group © Cray Inc. une 20, 2011
                                J                39
 32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size
                        Unable to take advantage of file system parallelism
                        Access to multiple disks adds overhead which hurts performance


                                     Single Writer
                                   Write Performance
                    120


                    100


                     80
     Write (MB/s)




                                                                                  1 MB Stripe
                     60
                                                                                  32 MB Stripe

                     40
                                                                                                       Lustre
                     20


                     0
                          1   2    4    16      32      64      128      160
                                       Stripe Count



40                                                    2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                                      J
 Single OST, 256 MB File Size
                        Performance can be limited by the process (transfer size) or file system
                         (stripe size)

                                            Single Writer
                                        Transfer vs. Stripe Size
                    140


                    120


                    100
     Write (MB/s)




                     80
                                                                                        32 MB Transfer
                     60
                                                                                        8 MB Transfer
                                                                                        1 MB Transfer
                     40                                                                                       Lustre
                     20


                     0
                          1    2    4        8       16       32      64      128
                                          Stripe Size (MB)




41                                                           2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                                             J
 Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and
  possibly size
    lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)
    lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)
    export MPICH_MPIIO_HINTS=‘*: striping_factor=160’
 Files inherit striping information from the parent directory, this cannot be
  changed once the file is written
    Set the striping before copying in files




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011    42
Available Compilers
   Cray Scientific Libraries
Cray Message Passing Toolkit




     2011 HPCMP User Group © Cray Inc.   June 20, 2011   43
 Cray XT/XE Supercomputers come with compiler wrappers to simplify
  building parallel applications (similar the mpicc/mpif90)
    Fortran Compiler: ftn
    C Compiler: cc
    C++ Compiler: CC
 Using these wrappers ensures that your code is built for the compute
  nodes and linked against important libraries
    Cray MPT (MPI, Shmem, etc.)
    Cray LibSci (BLAS, LAPACK, etc.)
    …
 Choosing the underlying compiler is via the PrgEnv-* modules, do not call
  the PGI, Cray, etc. compilers directly.
 Always load the appropriate xtpe-<arch> module for your machine
    Enables proper compiler target
    Links optimized math libraries


                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   44
…from Cray’s Perspective

 PGI – Very good Fortran and C, pretty good C++
     Good vectorization
     Good functional correctness with optimization enabled
     Good manual and automatic prefetch capabilities
     Very interested in the Linux HPC market, although that is not their only focus
     Excellent working relationship with Cray, good bug responsiveness
 Pathscale – Good Fortran, C, possibly good C++
     Outstanding scalar optimization for loops that do not vectorize
     Fortran front end uses an older version of the CCE Fortran front end
     OpenMP uses a non-pthreads approach
     Scalar benefits will not get as much mileage with longer vectors
 Intel – Good Fortran, excellent C and C++ (if you ignore vectorization)
     Automatic vectorization capabilities are modest, compared to PGI and CCE
     Use of inline assembly is encouraged
     Focus is more on best speed for scalar, non-scaling apps
     Tuned for Intel architectures, but actually works well for some applications on
       AMD
                               2011 HPCMP User Group © Cray Inc.   June 20, 2011        45
…from Cray’s Perspective

 GNU so-so Fortran, outstanding C and C++ (if you ignore vectorization)
    Obviously, the best for gcc compatability
    Scalar optimizer was recently rewritten and is very good
    Vectorization capabilities focus mostly on inline assembly
    Note the last three releases have been incompatible with each other (4.3, 4.4,
      and 4.5) and required recompilation of Fortran modules
 CCE – Outstanding Fortran, very good C, and okay C++
    Very good vectorization
    Very good Fortran language support; only real choice for Coarrays
    C support is quite good, with UPC support
    Very good scalar optimization and automatic parallelization
    Clean implementation of OpenMP 3.0, with tasks
    Sole delivery focus is on Linux-based Cray hardware systems
    Best bug turnaround time (if it isn’t, let us know!)
    Cleanest integration with other Cray tools (performance tools, debuggers,
     upcoming productivity tools)
    No inline assembly support

                               2011 HPCMP User Group © Cray Inc.   June 20, 2011      46
 PGI
       -fast –Mipa=fast(,safe)
       If you can be flexible with precision, also try -Mfprelaxed
       Compiler feedback: -Minfo=all -Mneginfo
       man pgf90; man pgcc; man pgCC; or pgf90 -help
 Cray
    <none, turned on by default>
    Compiler feedback: -rm (Fortran) -hlist=m (C)
    If you know you don’t want OpenMP: -xomp or -Othread0
    man crayftn; man craycc ; man crayCC
 Pathscale
    -Ofast Note: this is a little looser with precision than other compilers
    Compiler feedback: -LNO:simd_verbose=ON
    man eko (“Every Known Optimization”)
 GNU
    -O2 / -O3
    Compiler feedback: good luck
    man gfortran; man gcc; man g++
 Intel
    -fast
    Compiler feedback:
    man ifort; man icc; man iCC

                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011   47
2011 HPCMP User Group © Cray Inc.   June 20, 2011   48
 Traditional (scalar) optimizations are controlled via -O# compiler flags
    Default: -O2
 More aggressive optimizations (including vectorization) are enabled with
  the -fast or -fastsse metaflags
    These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre
      –Mautoinline -Mvect=sse -Mscalarsse
      -Mcache_align -Mflushz –Mpre
 Interprocedural analysis allows the compiler to perform whole-program
  optimizations. This is enabled with –Mipa=fast
 See man pgf90, man pgcc, or man pgCC for more information about
  compiler options.




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   49
 Compiler feedback is enabled with -Minfo and -Mneginfo
     This can provide valuable information about what optimizations were
         or were not done and why.
   To debug an optimized code, the -gopt flag will insert debugging
    information without disabling optimizations
   It’s possible to disable optimizations included with -fast if you believe one
    is causing problems
      For example: -fast -Mnolre enables -fast and then disables loop
         redundant optimizations
   To get more information about any compiler flag, add -help with the
    flag in question
      pgf90 -help -fast will give more information about the -fast
         flag
   OpenMP is enabled with the -mp flag



                              2011 HPCMP User Group © Cray Inc.   June 20, 2011     50
Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.

 -Kieee: All FP math strictly conforms to IEEE 754 (off by default)
 -Ktrap: Turns on processor trapping of FP exceptions
 -Mdaz: Treat all denormalized numbers as zero
 -Mflushz: Set SSE to flush-to-zero (on with -fast)
 -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to
  speed up some floating point optimizations
    Some other compilers turn this on by default, PGI chooses to favor
     accuracy to speed by default.




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   51
2011 HPCMP User Group © Cray Inc.   June 20, 2011   52
 Cray has a long tradition of high performance compilers on Cray
  platforms (Traditional vector, T3E, X1, X2)
    Vectorization
    Parallelization
    Code transformation
    More…
 Investigated leveraging an open source compiler called LLVM


 First release December 2008




                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   53
Fortran Source                 C and C++ Source                 C and C++ Front End
                                                                           supplied by Edison Design
                                                                           Group, with Cray-developed
  Fortran Front End                  C & C++ Front End                     code for extensions and
                                                                           interface support


                 Interprocedural Analysis
                                                                           Cray Inc. Compiler
                                                                           Technology
Compiler




                 Optimization and
                 Parallelization




           X86 Code                  Cray X2 Code
           Generator                 Generator
                                                                          X86 Code Generation from
                                                                          Open Source LLVM, with
                       Object File
                                                                          additional Cray-developed
                                                                          optimizations and interface
                                                                          support
                                 2011 HPCMP User Group © Cray Inc.   June 20, 2011                  54
 Standard conforming languages and programming models
    Fortran 2003
    UPC & CoArray Fortran
       Fully optimized and integrated into the compiler
       No preprocessor involved
       Target the network appropriately:
          GASNet with Portals
          DMAPP with Gemini & Aries

 Ability and motivation to provide high-quality support for custom
  Cray network hardware
 Cray technology focused on scientific applications
    Takes advantage of Cray’s extensive knowledge of automatic
     vectorization
    Takes advantage of Cray’s extensive knowledge of automatic
     shared memory parallelization
    Supplements, rather than replaces, the available compiler
     choices
                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011   55
 Make sure it is available
    module avail PrgEnv-cray
 To access the Cray compiler
    module load PrgEnv-cray
 To target the various chip
    module load xtpe-[barcelona,shanghi,mc8]
 Once you have loaded the module “cc” and “ftn” are the Cray
  compilers
    Recommend just using default options
    Use –rm (fortran) and –hlist=m (C) to find out what happened
 man crayftn


                              2011 HPCMP User Group © Cray Inc.   June 20, 2011   56
 Excellent Vectorization
    Vectorize more loops than other compilers
 OpenMP 3.0
   Task and Nesting
 PGAS: Functional UPC and CAF available today
 C++ Support
 Automatic Parallelization
    Modernized version of Cray X1 streaming capability
    Interacts with OMP directives
 Cache optimizations
    Automatic Blocking
    Automatic Management of what stays in cache
 Prefetching, Interchange, Fusion, and much more…


                               2011 HPCMP User Group © Cray Inc.   June 20, 2011   57
 Loop Based Optimizations
    Vectorization
    OpenMP
       Autothreading
   Interchange
   Pattern Matching
   Cache blocking/ non-temporal / prefetching
 Fortran 2003 Standard; working on 2008
 PGAS (UPC and Co-Array Fortran)
    Some performance optimizations available in 7.1
 Optimization Feedback: Loopmark
 Focus

                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   58
 Cray compiler supports a full and growing set of directives
  and pragmas

!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable


                              man directives
                              man loop_info
                                2011 HPCMP User Group © Cray Inc.   June 20, 2011   59
 Compiler can generate an filename.lst file.
     Contains annotated listing of your source code with letter indicating important
      optimizations
%%% L o o p m a r k L e g e n d %%%
 Primary Loop Type    Modifiers
 ------- ---- ----     ---------
                       a - vector atomic memory operation
 A - Pattern matched     b - blocked
 C - Collapsed         f - fused
 D - Deleted          i - interchanged
 E - Cloned           m - streamed but not partitioned
 I - Inlined           p - conditional, partial and/or computed
 M - Multithreaded    r - unrolled
 P - Parallel/Tasked    s - shortloop
 V - Vectorized        t - array syntax temp used
 W - Unwound          w - unwound

                                 2011 HPCMP User Group © Cray Inc.   June 20, 2011      60
• ftn –rm …      or cc –hlist=m …
29. b-------<   do i3=2,n3-1
30. b b-----<      do i2=2,n2-1
31. b b Vr--<        do i1=1,n1
32. b b Vr            u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr      >           + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr            u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr      >           + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr-->        enddo
37. b b Vr--<        do i1=2,n1-1
38. b b Vr            r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr      >              - a(0) * u(i1,i2,i3)
40. b b Vr      >              - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr      >              - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr-->        enddo
43. b b----->      enddo
44. b------->    enddo

                                     2011 HPCMP User Group © Cray Inc.   June 20, 2011   61
ftn-6289 ftn: VECTOR File = resid.f, Line = 29
 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines
   32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
 A loop starting at line 29 was blocked with block size 4.
ftn-6289 ftn: VECTOR File = resid.f, Line = 30
 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32
   and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 30
 A loop starting at line 30 was blocked with block size 4.
ftn-6005 ftn: SCALAR File = resid.f, Line = 31
 A loop starting at line 31 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 31
 A loop starting at line 31 was vectorized.
ftn-6005 ftn: SCALAR File = resid.f, Line = 37
 A loop starting at line 37 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 37
 A loop starting at line 37 was vectorized.
                                        2011 HPCMP User Group © Cray Inc.   June 20, 2011             62
 -hbyteswapio
   Link time option
   Applies to all unformatted fortran IO
 Assign command
   With the PrgEnv-cray module loaded do this:
setenv FILENV assign.txt
assign -N swap_endian g:su
assign -N swap_endian g:du


 Can use assign to be more precise

                       2011 HPCMP User Group © Cray Inc.   June 20, 2011   63
 OpenMP is ON by default
   Optimizations controlled by –Othread#
   To shut off use –Othread0 or –xomp or –hnoomp



 Autothreading is NOT on by default;
   -hautothread to turn on
   Modernized version of Cray X1 streaming capability
   Interacts with OMP directives


If you do not want to use OpenMP and have OMP directives
    in the code, make sure to make a run with OpenMP shut
                      off at compile time

                              2011 HPCMP User Group © Cray Inc.   June 20, 2011   64
2011 HPCMP User Group © Cray Inc.   June 20, 2011   65
 Cray have historically played a role in scientific library
  development
    BLAS3 were largely designed for Crays
    Standard libraries were tuned for Cray vector processors
      (later COTS)
    Cray have always tuned standard libraries for Cray
      interconnect
 In the 90s, Cray provided many non-standard libraries
    Sparse direct, sparse iterative
 These days the goal is to remain portable (standard APIs)
  whilst providing more performance
    Advanced features, tuning knobs, environment variables

                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   66
FFT             Dense                         Sparse
                       BLAS
CRAFFT                                                CASK
                    LAPACK

 FFTW            ScaLAPACK                           PETSc

                        IRT
P-CRAFFT                                            Trilinos
                       CASE


    IRT – Iterative Refinement Toolkit
    CASK – Cray Adaptive Sparse Kernels
    CRAFFT – Cray Adaptive FFT
    CASE – Cray Adaptive Simple Eigensolver
           2011 HPCMP User Group © Cray Inc.   June 20, 2011   69
 There are many libsci libraries on the systems
 One for each of
    Compiler (intel, cray, gnu, pathscale, pgi )
    Single thread, multiple thread
    Target (istanbul, mc12 )
 Best way to use libsci is to ignore all of this
    Load the xtpe-module (some sites set this by default)
    E.g. module load xtpe-shanghai / xtpe-istanbul / xtpe-mc8
 Cray’s drivers will link the library automatically
 PETSc, Trilinos, fftw, acml all have their own module
 Tip : make sure you have the correct library loaded e.g.
   –Wl, -ydgemm_
                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   70
 Perhaps you want to link another library such as ACML
 This can be done. If the library is provided by Cray, then load
  the module. The link will be performed with the libraries in the
  correct order.
 If the library is not provided by Cray and has no module, add it
  to the link line.
    Items you add to the explicit link will be in the correct place
 Note, to get explicit BLAS from ACML but scalapack from libsci
    Load acml module. Explicit calls to BLAS in code resolve
      from ACML
    BLAS calls from the scalapack code will be resolved from
      libsci (no way around this)
                        2011 HPCMP User Group © Cray Inc.   June 20, 2011   71
 Threading capabilities in previous libsci versions were poor
    Used PTHREADS (more explicit affinity etc)
    Required explicit linking to a _mp version of libsci
    Was a source of concern for some applications that need
     hybrid performance and interoperability with openMP
 LibSci 10.4.2 February 2010
    OpenMP-aware LibSci
    Allows calling of BLAS inside or outside parallel region
    Single library supported (there is still a single thread lib)
 Usage – load the xtpe module for your system (mc12)

GOTO_NUM_THREADS outmoded – use OMP_NUM_THREADS


                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   72
 Allows seamless calling of the BLAS within or without a parallel
  region

e.g. OMP_NUM_THREADS = 12

call dgemm(…) threaded dgemm is used with 12 threads
!$OMP PARALLEL DO
do
  call dgemm(…) single thread dgemm is used
end do

Some users are requesting a further layer of parallelism here (see
  later)


                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   73
120

               Libsci DGEMM efficiency
         100


         80
GFLOPs




                                                                1thread
         60
                                                                3threads
                                                                6threads
         40                                                     9threads
                                                                12threads

         20


          0



               Dimension (square) Inc.
                 2011 HPCMP User Group © Cray   June 20, 2011        74
140
                   Libsci-10.5.2 performance on 2 x MC12 2.0 GHz                              K=64
         120                          (Cray XE6)
                                                                                              K=128

         100                                                                                  K=200

                                                                                              K=228
         80
GFLOPS




                                                                                              K=256

         60                                                                                   K=300

                                                                                              K=400
         40
                                                                                              K=500
         20
                                                                                              K=600

          0                                                                                   K=700
               1     2      4           8           12              16             20    24
                                                                                              K=800
                                Number of threads
                                2011 HPCMP User Group © Cray Inc.        June 20, 2011                75
 All BLAS libraries are optimized for rank-k update

               *                          =




 However, a huge % of dgemm usage is not from solvers but explicit calls
 E.g. DCA++ matrices are of this form

                       *              =




 How can we very easily provide an optimization for these types of
  matrices?
                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   76
 Cray BLAS existed on every Cray machine between Cray-2 and Cray
  X2
 Cray XT line did not include Cray BLAS
    Cray’s expertise was in vector processors
    GotoBLAS was the best performing x86 BLAS
 LibGoto is now discontinued
 In Q3 2011 LibSci will be released with Cray BLAS




                       2011 HPCMP User Group © Cray Inc.   June 20, 2011   77
1.   Customers require more OpenMP features unobtainable
     with current library
2.   Customers require more adaptive performance for
     unusual problems .e.g. DCA++
3.   Interlagos / Bulldozer is a dramatic shift in
     ISA/architecture/performance
4.   Our auto-tuning framework has advanced to the point
     that we can tackle this problem (good BLAS is easy,
     excellent BLAS is very hard)
5.   Need for Bit-reproducable BLAS at high-performance



                    2011 HPCMP User Group © Cray Inc.   June 20, 2011   78
"anything that can be represented in C, Fortran or ASM
  code can be generated automatically by one instance
  of an abstract operator in high-level code“

In other words, if we can create a purely general model
  of matrix-multiplication, and create every instance of
  it, then at least one of the generated schemes will
  perform well



                    2011 HPCMP User Group © Cray Inc.   June 20, 2011   79
 Start with a completely general formulation of the BLAS
 Use a DSL that expresses every important optimization
 Auto-generate every combination of orderings, buffering, and
    optimization
   For every combination of the above, sweep all possible sizes
   For a given input set ( M, N, K, datatype, alpha, beta ) map the
    best dgemm routine to the input
   The current library should be a specific instance of the above
   Worst-case performance can be no worse than current library
   The lowest level of blocking is a hand-written assembly kernel



                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   80
7.5

7.45

 7.4

7.35

 7.3
                                                            bframe GFLOPS
7.25
                                                            libsci
 7.2

7.15

 7.1

7.05


       143
         72
         12
         17
         22
         27




         62




        133
         67
         37
         42


         57




       105
          2
          7




         47




       100

       128

       138
         95
         32




         52




        2011 HPCMP User Group © Cray Inc.   June 20, 2011             81
 New optimizations for Gemini network in the ScaLAPACK LU and Cholesky
     routines

1.    Change the default broadcast topology to match the Gemini network

2.    Give tools to allow the topology to be changed by the user

3.    Give guidance on how grid-shape can affect the performance




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   82
 Parallel Version of LAPACK GETRF
 Panel Factorization
   Only single column block is involved
   The rest of PEs are waiting
 Trailing matrix update
   Major part of the computation
   Column-wise broadcast (Blocking)
   Row-wise broadcast (Asynchronous)
 Data is packed before sending using PBLAS
 Broadcast uses BLACS library
 These broadcasts are the major communication
  patterns
                        2011 HPCMP User Group © Cray Inc.   June 20, 2011   83
 MPI default
    Binomial Tree + node-aware broadcast
    All PEs makes implicit barrier to make sure the completion
    Not suitable for rank-k update


 Bidirectional-Ring broadcast
    Root PE makes 2 MPI Send calls to both of the directions
    The immediate neighbor finishes first
    ScaLAPACK’s default
    Better than MPI




                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   84
 Increasing Ring Broadcast (our new default)
    Root makes a single MPI call to the immediate neighbor
    Pipelining
    Better than bidirectional ring
    The immediate neighbor finishes first



 Multi-Ring Broadcast (2, 4, 8 etc)
    The immediate neighbor finishes first
    The root PE sends to multiple sub-rings
        Can be done with tree algorithm

    2 rings seems the best for row-wise broadcast of LU




                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011   85
 Hypercube
    Behaves like MPI default
    Too many collisions in the message traffic
 Decreasing Ring
    The immediate neighbor finishes last
    No benefit in LU
 Modified Increasing Ring
    Best performance in HPL
    As good as increasing ring




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   86
XDLU performance: 3072 cores, size=65536
         10000
         9000
         8000
         7000
         6000
Gflops




         5000
         4000
         3000                                                                                      SRING
                                                                                                   IRING
         2000
          1000
            0
                 32     64   32    64        32       64        32         64          32    64

                 48     48   24    24        12        12       32         32          16    16

                 64     64   128   128      256 256               96       96          192   192
                                            NB / P / Q
                                   2011 HPCMP User Group © Cray Inc.   June 20, 2011                  87
XDLU performance: 6144 cores, size=65536
         14000

         12000

         10000

         8000
Gflops




         6000
                                                                                                    SRING
         4000
                                                                                                    IRING
         2000

            0
                 32      64   32    64        32       64        32         64          32    64

                 48      48   24    24        12        12       64         64          32    32

                 128    128   256   256      512       512         96       96          192   192
                                             NB / P / Q
                                    2011 HPCMP User Group © Cray Inc.   June 20, 2011                  88
 Row Major Process Grid puts adjacent PEs in the same row
    Adjacent PEs are most probably located in the same node
    In flat MPI, 16 or 24 PEs are in the same node
    In hybrid mode, several are in the same node
 Most MPI sends in I-ring happen in the same node
    MPI has good shared-memory device
 Good pipelining



      Node 0                          Node 1                                   Node 2




                           2011 HPCMP User Group © Cray Inc.   June 20, 2011            89
 For PxGETRF:                               The variables let users to choose
    SCALAPACK_LU_CBCAST                         broadcast algorithm :
    SCALAPACK_LU_RBCAST                           IRING       increasing ring
 For PxPOTRF:
                                                    (default value)
                                                   DRING       decreasing ring
    SCALAPACK_LLT_CBCAST
                                                   SRING       split ring (old default
    SCALAPCK_LLT_RBCAST
                                                    value)
    SCALAPACK_UTU_CBCAST
                                                   MRING       multi-ring
      SCALAPACK_UTU_RBCAST
                                                   HYPR        hypercube
                                                   MPI         mpi_bcast
                                                   TREE        tree
 There is also a set function, allowing
                                                   FULL        full connected
  the user to change these on the fly




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011            91
 Grid shape / size
    Square grid is most common
    Try to use Q = x * P grids, where x = 2, 4, 6, 8
    Square grids not often the best
 Blocksize
    Unlike HPL, fine-tuning not important.
    64 usually the best
 Ordering
    Try using column-major ordering, it can be better
 BCAST
    The new default will be a huge improvement if you can make your grid
      the right way. If you cannot, play with the environment variables.



                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   92
2011 HPCMP User Group © Cray Inc.   June 20, 2011   93
 Full MPI2 support (except process spawning) based on ANL MPICH2
    Cray used the MPICH2 Nemesis layer for Gemini
    Cray-tuned collectives
    Cray-tuned ROMIO for MPI-IO


    Current Release: 5.3.0 (MPICH 1.3.1)
       Improved MPI_Allreduce and MPI_alltoallv
       Initial support for checkpoint/restart for MPI or Cray SHMEM on XE
        systems
       Improved support for MPI thread safety.
       module load xt-mpich2
 Tuned SHMEM library
    module load xt-shmem


                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   94
MPI_Alltoall with 10,000 Processes
                          Comparing Original vs Optimized Algorithms
                                     on Cray XE6 Systems
               25000000


               20000000
Microseconds




               15000000

                                                                                             Original Algorithm
               10000000
                                                                                             Optimized Algorithm

                5000000


                     0
                           256   512    1024 2048 4096 8192 16384 32768
                                       MessageHPCMP User Group © Cray Inc.
                                           2011
                                                Size (in bytes)              June 20, 2011                   95
8-Byte MPI_Allgather and MPI_Allgatherv Scaling
                      Comparing Original vs Optimized Algorithms
     45000                       on Cray XE6 Systems
    40000
                           MPI_Allgather and
      35000                MPI_Allgatherv algorithms
                           optimized for Cray XE6.
     30000
Microseconds




                       Original Allgather
     25000
    20000              Optimized Allgather

       15000           Original Allgatherv

       10000           Optimized Allgatherv

           5000
               0
                   1024p         2048p         4096p       8192p           16384p         32768p
                                             Number ofUser Group © Cray Inc. June 20, 2011
                                               2011 HPCMP Processes                                96
 Default is 8192 bytes
 Maximum size message that can go through the eager protocol.
 May help for apps that are sending medium size messages, and do better
  when loosely coupled. Does application have a large amount of time in
  MPI_Waitall? Setting this environment variable higher may help.
 Max value is 131072 bytes.
 Remember for this path it helps to pre-post receives if possible.
 Note that a 40-byte CH3 header is included when accounting for the
  message size.




                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   97
 Default is 64 32K buffers ( 2M total )
 Controls number of 32K DMA buffers available for each rank to use in the
  Eager protocol described earlier
 May help to modestly increase. But other resources constrain the usability
  of a large number of buffers.




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   98
2011 HPCMP User Group © Cray Inc.   June 20, 2011   99
 What do I mean by PGAS?
    Partitioned Global Address Space
       UPC
       CoArray Fortran ( Fortran 2008 )
       SHMEM (I will count as PGAS for convenience)
 SHMEM: Library based
    Not part of any language standard
    Compiler independent
       Compiler has no knowledge that it is compiling a PGAS code and
        does nothing different. I.E. no transformations or optimizations




                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   100
 UPC
    Specification that extends the ISO/IEC 9899 standard for C
    Integrated into the language
    Heavily compiler dependent
         Compiler intimately involved in detecting and executing remote
         references
    Flexible, but filled with challenges like pointers, a lack of true
     multidimensional arrays, and many options for distributing data
 Fortran 2008
    Now incorporates coarrays
    Compiler dependent
    Philosophically different from UPC
        Replication of arrays on every image with “easy and obvious” ways
         to access those remote locations.



                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   101
2011 HPCMP User Group © Cray Inc.   June 20, 2011   102
 Translate the UPC source code into hardware executable operations that
  produce the proper behavior, as defined by the specification
    Storing to a remote location?
    Loading from a remote location?
    When does the transfer need to be complete?
    Are there any dependencies between this transfer and anything else?
    No ordering guarantees provided by the network, compiler is
     responsible for making sure everything gets to its destination in the
     correct order.




                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   103
for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) {
    local_data[i] += global_2d[i][target];
}


for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) {
    temp = pgas_get(&global_2d[i]); // Initiate the get
    pgas_fence(); // makes sure the get is complete
    local_data[i] += temp; // Use the local location to complete the operation
}
 The compiler must
       Recognize you are referencing a shared location
       Initiate the load of the remote data
       Make sure the transfer has completed
       Proceed with the calculation
       Repeat for all iterations of the loop

                                    2011 HPCMP User Group © Cray Inc.   June 20, 2011   104
for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) {
    temp = pgas_get(&global_2d[i]); // Initiate the get
    pgas_fence(); // makes sure the get is complete
    local_data[i] += temp; // Use the local location to complete the operation
}

 Simple translation results in
    Single word references
    Lots of fences
    Little to no latency hiding
    No use of special hardware
 Nothing here says “fast”

                                                                          2011 HPCMP User     105
                                                          June 20, 2011   Group © Cray Inc.
Want the compiler to generate code that will run as fast as possible given
   what the user has written, or allow the user to get fast performance with
                             simple modifications.
 Increase message size
    Do multi / many word transfers whenever possible, not single word.
 Minimize fences
    Delay fence “as much as possible”
    Eliminate the fence in some circumstances
 Use the appropriate hardware
    Use on-node hardware for on-node transfers
    Use transfer mechanism appropriate for this message size
    Overlap communication and computation
    Use hardware atomic functions where appropriate




                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   106
Primary Loop Type       Modifiers
  A - Pattern matched   a - atomic memory operation
                        b - blocked
  C - Collapsed         c - conditional and/or computed
  D - Deleted
  E - Cloned            f - fused
  G - Accelerated       g - partitioned
  I - Inlined            i - interchanged
  M - Multithreaded     m - partitioned
                        n - non-blocking remote transfer
                        p - partial
                        r - unrolled
                        s - shortloop
  V - Vectorized        w - unwound

                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   107
2011 HPCMP User Group © Cray Inc.   June 20, 2011   108
15.          shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS];
…
 83. 1             before = upc_ticks_now();
 84. 1 r8------<     for ( i = 0, j = target; i < ELEMS_PER_THREAD ;
 85. 1 r8                i += 1, j += THREADS ) {
 86. 1 r8 n              local_data[i]= global_1d[j];
 87. 1 r8------>     }
 88. 1             after = upc_ticks_now();

 1D get BW= 0.027598 Gbytes/s




                                                                        2011 HPCMP User     109
                                                        June 20, 2011   Group © Cray Inc.
15.      shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS];
…
101. 1   before = upc_ticks_now();
102. 1   upc_memget(&local_data[0],&global_1d[target],8*ELEMS_PER_THREAD);
103. 1
104. 1   after = upc_ticks_now();




 1D get BW= 0.027598 Gbytes/s
 1D upc_memget BW= 4.972960 Gbytes/s

 upc_memget is 184 times faster!!

                                                                2011 HPCMP User     110
                                                June 20, 2011   Group © Cray Inc.
16.        shared long global_2d[MAX_ELEMS_PER_THREAD][THREADS];
 …
 121. 1 A-------<   for ( i = 0; i < ELEMS_PER_THREAD; i+=1) {
 122. 1 A               local_data[i] = global_2d[i][target];
 123. 1 A------->   }


 1D get BW= 0.027598 Gbytes/s
 1D upc_memget BW= 4.972960 Gbytes/s
 2D get time BW= 4.905653 Gbytes/s
 Pattern matching can give you the same
 performance as if using upc_memget

                                                                                2011 HPCMP User     111
                                                                June 20, 2011   Group © Cray Inc.
2011 HPCMP User Group © Cray Inc.   June 20, 2011   112
 PGAS data references made by the single statement immediately following the pgas
  defer_sync directive will not be synchronized until the next fence instruction.
    Only applies to next UPC/CAF statement
    Does not apply to upc “routines”
    Does not apply to shmem routines


 Normally the compiler synchronizes the references in a statement as late as
  possible without violating program semantics. The purpose of the defer_sync
  directive is to synchronize the references even later, beyond where the compiler
  can determine it is safe.


 Extremely powerful!
    Can easily overlap communication and computation with this statement
    Can apply to both “gets” and “puts”
    Can be used to implement a variety of “tricks”. Use your imagination!




                               2011 HPCMP User Group © Cray Inc.   June 20, 2011     113
CrayPAT




2011 HPCMP User Group © Cray Inc.   June 20, 2011   114
 Future system basic characteristics:
    Many-core, hybrid multi-core computing


    Increase in on-node concurrency
        10s-100s of cores sharing memory
        With or without a companion accelerator
        Vector hardware at the low level


 Impact on applications:
    Restructure / evolve applications while using existing programming
      models to take advantage of increased concurrency

    Expand on use of mixed-mode programming models (MPI + OpenMP +
      accelerated kernels, etc.)

                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   115
 Focus on automation (simplify tool usage, provide feedback based on
  analysis)

 Enhance support for multiple programming models within a program (MPI,
  PGAS, OpenMP, SHMEM)

 Scaling (larger jobs, more data, better tool response)


 New processors and interconnects


 Extend performance tools to include pre-runtime optimization information
  from the Cray compiler


                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   116
 New predefined wrappers (ADIOS, ARMCI, PetSc, PGAS libraries)
 More UPC and Co-array Fortran support
 Support for non-record locking file systems
 Support for applications built with shared libraries
 Support for Chapel programs
 pat_report tables available in Cray Apprentice2




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   117
 Enhanced PGAS support is available in perftools 5.1.3 and later
     Profiles of a PGAS program can be created to show:
           Top time consuming functions/line numbers in the code
           Load imbalance information
           Performance statistics attributed to user source by default
           Can expose statistics by library as well
                 To see underlying operations, such as wait time on barriers
     Data collection is based on methods used for MPI library
           PGAS data is collected by default when using Automatic Profiling Analysis
            (pat_build –O apa)
           Predefined wrappers for runtime libraries (caf, upc, pgas) enable attribution of
            samples or time to user source
     UPC and SHMEM heap tracking coming in subsequent release
           -g heap will track shared heap in addition to local heap



June 20, 2011                                2011 HPCMP User Group © Cray Inc.                 118
Table 1:    Profile by Function

      Samp % | Samp | Imb. |   Imb. |Group
             |      | Samp | Samp % | Function
             |      |      |        | PE='HIDE'

     100.0% |   48 |   -- |     -- |Total
    |------------------------------------------
    | 95.8% |    46 |   -- |     -- |USER
    ||-----------------------------------------
    || 83.3% |    40 | 1.00 |   3.3% |all2all
    ||   6.2% |    3 | 0.50 | 22.2% |do_cksum
    ||   2.1% |    1 | 1.00 | 66.7% |do_all2all
    ||   2.1% |    1 | 0.50 | 66.7% |mpp_accum_long
    ||   2.1% |    1 | 0.50 | 66.7% |mpp_alloc
    ||=========================================
    |   4.2% |    2 |   -- |     -- |ETC
    ||-----------------------------------------
    ||   4.2% |    2 | 0.50 | 33.3% |bzero
    |==========================================




June 20, 2011                             2011 HPCMP User Group © Cray Inc.   119
Table 2:        Profile by Group, Function, and Line


      Samp % | Samp | Imb. |         Imb. |Group
                |        | Samp | Samp % | Function
                |        |      |        |   Source
                |        |      |        |    Line
                |        |      |        |     PE='HIDE'


      100.0% |        48 |    -- |    -- |Total
    |--------------------------------------------
    |    95.8% |       46 |   -- |     -- |USER
    ||-------------------------------------------
    || 83.3% |    40 |   -- |     -- |all2all
    3|        |      |      |        | mpp_bench.c
    4|        |      |      |        | line.298
    ||   6.2% |    3 |   -- |     -- |do_cksum
    3|        |      |      |        | mpp_bench.c
    ||||-----------------------------------------
    4|||   2.1% |    1 | 0.25 | 33.3% |line.315
    4|||   4.2% |    2 | 0.25 | 16.7% |line.316
    ||||=========================================




June 20, 2011                                         2011 HPCMP User Group © Cray Inc.   120
Table 1:        Profile by Function and Callers, with Line Numbers
        Samp % | Samp |Group
                |          | Function
                |          |       Caller
                |          |        PE='HIDE’
        100.0% |        47 |Total
    |---------------------------
    |     93.6% |        44 |ETC
    ||--------------------------
    ||     85.1% |        40 |upc_memput
    3|              |          | all2all:mpp_bench.c:line.298
    4|              |          |    do_all2all:mpp_bench.c:line.348
    5|              |          |     main:test_all2all.c:line.70
    ||      4.3% |         2 |bzero
    3|              |          | (N/A):(N/A):line.0
    ||      2.1% |         1 |upc_all_alloc
    3|              |          | mpp_alloc:mpp_bench.c:line.143
    4|              |          |    main:test_all2all.c:line.25
    ||      2.1% |         1 |upc_all_reduceUL
    3|              |          | mpp_accum_long:mpp_bench.c:line.185
    4|              |          |    do_cksum:mpp_bench.c:line.317
    5|              |          |     do_all2all:mpp_bench.c:line.341
    6|              |          |      main:test_all2all.c:line.70
    ||==========================



June 20, 2011                                                 2011 HPCMP User Group © Cray Inc.   121
Table 1:        Profile by Function and Callers, with Line Numbers


      Time % |          Time |       Calls |Group
                |            |             | Function
                |            |             |       Caller
                |            |             |       PE='HIDE'


      100.0% | 0.795844 | 73904.0 |Total
    |-----------------------------------------
    |    78.9% | 0.628058 | 41121.8 |PGAS
    ||----------------------------------------
    ||    76.1% | 0.605945 | 32768.0 |__pgas_put
    3|              |            |             | all2all:mpp_bench.c:line.298
    4|              |            |             |    do_all2all:mpp_bench.c:line.348
    5|              |            |             |     main:test_all2all.c:line.70
    ||      1.5% | 0.012113 |           10.0 |__pgas_barrier
    3|              |            |             | (N/A):(N/A):line.0
    …



June 20, 2011                                               2011 HPCMP User Group © Cray Inc.   122
…
    ||========================================
    |    15.7% | 0.125006 |     3.0 |USER
    ||----------------------------------------
    ||    12.2% | 0.097125 |     1.0 |do_all2all
    3|           |          |        | main:test_all2all.c:line.70
    ||      3.5% | 0.027668 |    1.0 |main
    3|           |          |        | (N/A):(N/A):line.0
    ||========================================
    |     5.4% | 0.042777 | 32777.2 |UPC
    ||----------------------------------------
    ||      5.3% | 0.042321 | 32768.0 |upc_memput
    3|           |          |        | all2all:mpp_bench.c:line.298
    4|           |          |        |   do_all2all:mpp_bench.c:line.348
    5|           |          |        |      main:test_all2all.c:line.70
    |=========================================




June 20, 2011                                  2011 HPCMP User Group © Cray Inc.   123
New text
                                                    table icon



                                                    Right click
                                                     for table
                                                    generation
                                                     options




2011 HPCMP User Group © Cray Inc.   June 20, 2011                 124
2011 HPCMP User Group © Cray Inc.   June 20, 2011   125
 Scalability
    New .ap2 data format and client / server model
        Reduced pat_report processing and report generation times
        Reduced app2 data load times
        Graphical presentation handled locally (not passed through ssh
           connection)
          Better tool responsiveness
          Minimizes data loaded into memory at any given time
          Reduced server footprint on Cray XT/XE service node
          Larger jobs supported

    Distributed Cray Apprentice2 (app2) client for Linux
        app2 client for Mac and Windows laptops coming later this year


                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   126
 CPMD
   MPI, instrumented with pat_build –u, HWPC=1
   960 cores
                   Perftools 5.1.3            Perftools 5.2.0
  .xf -> .ap2        88.5 seconds             22.9 seconds
  ap2 -> report    1512.27 seconds            49.6 seconds

 VASP
   MPI, instrumented with pat_build –gmpi –u, HWPC=3
   768 cores

                   Perftools 5.1.3            Perftools 5.2.0
  .xf -> .ap2      45.2 seconds               15.9 seconds
  ap2 -> report    796.9 seconds              28.0 seconds

                        2011 HPCMP User Group © Cray Inc.   June 20, 2011   127
‘:’ signifies
 From Linux desktop –                                                            a remote
                                                                                     host
                                                                                instead of
 % module load perftools
                                                                                   ap2 file

 % app2
 % app2 kaibab:
 % app2 kaibab:/lus/scratch/heidi/swim+pat+10302-0t.ap2


 File->Open Remote…




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011                   128
 Optional app2 client for Linux desktop available as of 5.2.0


 Can still run app2 from Cray service node


 Improves response times as X11 traffic is no longer passed through the ssh
  connection

 Replaces 32-bit Linux desktop version of Cray Apprentice2


 Uses libssh to establish connection


 app2 clients for Windows and Mac coming in subsequent release


                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   129
Linux desktop     All data from           Cray XT login             Collected       Compute nodes
                  my_program.ap2 +                                  performance
 X Window
                  X11 protocol                                      data
                                            app2
 System
 application                                my_program.ap2                              my_program+apa




 Log into Cray XT/XE login node
     % ssh –Y seal


 Launch Cray Apprentice2 on Cray XT/XE login node
     % app2 /lus/scratch/mydir/my_program.ap2
     User Interface displayed on desktop via ssh trusted X11 forwarding
     Entire my_program.ap2 file loaded into memory on XT login node (can
        be Gbytes of data)
                                2011 HPCMP User Group © Cray Inc.   June 20, 2011                   130
Linux desktop    User requested data      Cray XT login             Collected
                                                                                    Compute nodes
                 from
 X Window                                                           performance
                 my_program.ap2             app2 server
 System                                                             data
 application
                                            my_program.ap2
                                                                                        my_program+apa
 app2 client




 Launch Cray Apprentice2 on desktop, point to data
     % app2 seal:/lus/scratch/mydir/my_program.ap2


     User Interface displayed on desktop via X Windows-based software
     Minimal subset of data from my_program.ap2 loaded into memory on
      Cray XT/XE service node at any given time
     Only data requested sent from server to client



                                2011 HPCMP User Group © Cray Inc.   June 20, 2011                   131
2011 HPCMP User Group © Cray Inc.   June 20, 2011   132
 Major change to the way HW counters are collected starting with CPMAT
  5.2.1 and CLE 4.0 (In conjunction with Interlagos support)

 Linux has officially incorporated support for accessing counters through a
  perf_events subsystem. Until this, Linux kernels have had to be patched to
  add support for perfmon2 which provided access to the counters for PAPI
  and for CrayPat.

 Seamless to users except –
    Overhead incurred when accessing counters has increased
    Creates additional application perturbation
    Working to bring this back in line with perfmon2 overhead




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   133
 When possible, CrayPat will identify dominant communication grids
  (communication patterns) in a program
    Example: nearest neighbor exchange in 2 or 3 dimensions
       Sweep3d uses a 2-D grid for communication


 Determine whether or not a custom MPI rank order will produce a
  significant performance benefit

 Custom rank orders are helpful for programs with significant point-to-point
  communication

 Doesn’t interfere with MPI collective communication optimizations



                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   134
 Focuses on intra-node communication (place ranks that communication
  frequently on the same node, or close by)
    Option to focus on other metrics such as memory bandwidth


 Determine rank order used during run that produced data
 Determine grid that defines the communication


 Produce a custom rank order if it’s beneficial based on grid size, grid order
  and cost metric

 Summarize findings in report
 Describe how to re-run with custom rank order



                             2011 HPCMP User Group © Cray Inc.   June 20, 2011    135
For Sweep3d with 768 MPI ranks:

This application uses point-to-point MPI communication between nearest
  neighbors in a 32 X 24 grid pattern. Time spent in this communication
  accounted for over 50% of the execution time. A significant fraction (but
  not more than 60%) of this time could potentially be saved by using the
  rank order in the file MPICH_RANK_ORDER.g which was generated along
  with this report.

To re-run with a custom rank order …




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   136
 Assist the user with application performance analysis and optimization
           Help user identify important and meaningful information from
            potentially massive data sets
           Help user identify problem areas instead of just reporting data
           Bring optimization knowledge to a wider set of users



     Focus on ease of use and intuitive user interfaces
        Automatic program instrumentation
        Automatic analysis


     Target scalability issues in all areas of tool development
        Data management
            Storage, movement, presentation



June 20, 2011                           2011 HPCMP User Group © Cray Inc.      137
 Supports traditional post-mortem performance analysis
       Automatic identification of performance problems
                 Indication of causes of problems
                 Suggestions of modifications for performance improvement


 CrayPat
       pat_build: automatic instrumentation (no source code changes needed)
       run-time library for measurements (transparent to the user)
       pat_report for performance analysis reports
       pat_help: online help utility


 Cray Apprentice2
       Graphical performance analysis and visualization tool




June 20, 2011                                  2011 HPCMP User Group © Cray Inc.   138
 CrayPat
           Instrumentation of optimized code
           No source code modification required
           Data collection transparent to the user
           Text-based performance reports
           Derived metrics
           Performance analysis


     Cray Apprentice2
           Performance data visualization tool
           Call tree view
           Source code mappings




June 20, 2011                           2011 HPCMP User Group © Cray Inc.   139
 When performance measurement is triggered
         External agent (asynchronous)
                 Sampling
                   Timer interrupt
                   Hardware counters overflow
         Internal agent (synchronous)
                 Code instrumentation
                   Event based
                   Automatic or manual instrumentation
   How performance data is recorded
         Profile ::= Summation of events over time
                 run time summarization (functions, call sites, loops, …)
         Trace file ::= Sequence of events over time




June 20, 2011                                2011 HPCMP User Group © Cray Inc.   140
 Millions of lines of code
           Automatic profiling analysis
                 Identifies top time consuming routines
                 Automatically creates instrumentation template customized to your
              application
     Lots of processes/threads
        Load imbalance analysis
            Identifies computational code regions and synchronization calls that
              could benefit most from load balance optimization
            Estimates savings if corresponding section of code were balanced
     Long running applications
        Detection of outliers




June 20, 2011                              2011 HPCMP User Group © Cray Inc.          141
 Important performance statistics:


       Top time consuming routines


       Load balance across computing resources


       Communication overhead


       Cache utilization


       FLOPS


       Vectorization (SSE instructions)


       Ratio of computation versus communication

June 20, 2011                          2011 HPCMP User Group © Cray Inc.   142
 No source code or makefile modification required
           Automatic instrumentation at group (function) level
                 Groups: mpi, io, heap, math SW, …


     Performs link-time instrumentation
           Requires object files
           Instruments optimized code
           Generates stand-alone instrumented program
           Preserves original binary
           Supports sample-based and event-based instrumentation




June 20, 2011                               2011 HPCMP User Group © Cray Inc.   143
       Analyze the performance data and direct the user to meaningful
            information

           Simplifies the procedure to instrument and collect performance data for
            novice users

          Based on a two phase mechanism
           1. Automatically detects the most time consuming functions in the
              application and feeds this information back to the tool for further
              (and focused) data collection

           2.   Provides performance information on the most significant parts of the
                application


June 20, 2011                             2011 HPCMP User Group © Cray Inc.             144
 Performs data conversion


             Combines information from binary with raw performance
                data

       Performs analysis on data


     Generates text report of performance results


     Formats data for input into Cray Apprentice2

June 20, 2011                     2011 HPCMP User Group © Cray Inc.   145
 Craypat / Cray Apprentice2 5.0 released September 10, 2009


           New internal data format
           FAQ
           Grid placement support
           Better caller information (ETC group in pat_report)
           Support larger numbers of processors
           Client/server version of Cray Apprentice2
           Panel help in Cray Apprentice2




June 20, 2011                           2011 HPCMP User Group © Cray Inc.   146
       Access performance tools software

                % module load perftools

           Build application keeping .o files (CCE: -h keepfiles)

                % make clean
                % make

          Instrument application for automatic profiling analysis
              You should get an instrumented program a.out+pat

                % pat_build –O apa a.out

          Run application to get top time consuming routines
             You should get a performance file (“<sdatafile>.xf”) or
              multiple files in a directory <sdatadir>

                % aprun … a.out+pat               (or        qsub <pat script>)

June 20, 2011                              2011 HPCMP User Group © Cray Inc.      147
      Generate report and .apa instrumentation file

       % pat_report –o my_sampling_report [<sdatafile>.xf |
          <sdatadir>]

      Inspect .apa file and sampling report

      Verify if additional instrumentation is needed




June 20, 2011                                  2011 HPCMP User Group © Cray Inc.   Slide 148
# You can edit this file, if desired, and use it                                    # 43.37% 99659 bytes
# to reinstrument the program for tracing like this:                                      -T mlwxyz_
#
#        pat_build -O mhd3d.Oapa.x+4125-401sdt.apa                                  # 16.09% 17615 bytes
#                                                                                         -T half_
# These suggested trace options are based on data from:
#                                                                                   # 6.82% 6846 bytes
#     /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2,                           -T artv_
       /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf

                                                                                    # 1.29% 5352 bytes
# ----------------------------------------------------------------------
                                                                                          -T currenh_

#     HWPC group to collect by default.
                                                                                    # 1.03% 25294 bytes
                                                                                          -T bndbo_
 -Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics.

                                                                                    # Functions below this point account for less than 10% of samples.
# ----------------------------------------------------------------------


#     Libraries to trace.
                                                                                    # 1.03% 31240 bytes
                                                                                    #      -T bndto_
 -g mpi

                                                                                    ...
# ----------------------------------------------------------------------

                                                                                    # ----------------------------------------------------------------------
#     User-defined functions to trace, sorted by % of samples.
#     Limited to top 200. A function is commented out if it has < 1%
                                                                                     -o mhd3d.x+apa                    # New instrumented program.
#     of samples, or if a cumulative threshold of 90% has been reached,
#     or if it has size < 200 bytes.
                                                                                     /work/crayadm/ldr/mhd3d/mhd3d.x # Original program.

# Note: -u should NOT be specified as an additional option.


                                                                                                     June 20, 2011                                             149
                                                                           2011 HPCMP User Group © Cray Inc.
   biolib Cray Bioinformatics library routines           omp    OpenMP API (not supported on
   blacs Basic Linear Algebra communication               Catamount)
    subprograms                                           omp-rtl OpenMP runtime library (not
   blas    Basic Linear Algebra subprograms               supported on Catamount)

   caf Co-Array Fortran (Cray X2 systems only)  portals Lightweight message passing API
 fftw   Fast Fourier Transform library (64-bit  pthreads POSIX threads (not supported on
  only)                                           Catamount)

 hdf5   manages extremely large and complex  scalapack Scalable LAPACK
    data collections                                      shmem          SHMEM
   heap     dynamic heap                                 stdio all library functions that accept or return
   io     includes stdio and sysio groups                 the FILE* construct

   lapack Linear Algebra Package                         sysio    I/O system calls

   lustre Lustre File System                             system system calls

   math     ANSI math                                    upc      Unified Parallel C (Cray X2 systems only)
   mpi     MPI
   netcdf network common data form (manages
    array-oriented scientific data)



                                      2011 HPCMP User Group © Cray Inc.     June 20, 2011                       150
0   Summary with instruction                11 Floating point operations
    metrics                                   mix (2)
1   Summary with TLB metrics                12 Floating point operations
                                              mix (vectorization)
2   L1 and L2 metrics
                                            13 Floating point operations
3   Bandwidth information                     mix (SP)
4   Hypertransport information              14 Floating point operations
5   Floating point mix                        mix (DP)
6    Cycles stalled, resources              15 L3 (socket-level)
    idle                                    16 L3 (core-level reads)
7    Cycles stalled, resources              17 L3 (core-level misses)
    full
                                            18 L3 (core-level fills caused
8   Instructions and branches                 by L2 evictions)
9   Instruction cache                       19 Prefetches
10 Cache hierarchy



                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   Slide 151
 Regions, useful to break up long routines
    int PAT_region_begin (int id, const char *label)
    int PAT_region_end (int id)
 Disable/Enable Profiling, useful for excluding initialization
    int PAT_record (int state)
 Flush buffer, useful when program isn’t exiting cleanly
    int PAT_flush_buffer (void)




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   153
      Instrument application for further analysis (a.out+apa)

       % pat_build –O <apafile>.apa

      Run application

       % aprun … a.out+apa         (or    qsub <apa script>)

      Generate text report and visualization file (.ap2)

       % pat_report –o my_text_report.txt [<datafile>.xf |
          <datadir>]


      View report in text and/or with Cray Apprentice2

       % app2 <datafile>.ap2



June 20, 2011                                 2011 HPCMP User Group © Cray Inc.   Slide 154
 MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)


     Number of files used to store raw data


           1 file created for program with 1 – 256 processes

           √n files created for program with 257 – n processes

           Ability to customize with PAT_RT_EXPFILE_MAX




June 20, 2011                           2011 HPCMP User Group © Cray Inc.   155
 Full trace files show transient events but are too large

 Current run-time summarization misses transient events

 Plan to add ability to record:

     Top N peak values (N small)‫‏‬

     Approximate std dev over time

     For time, memory traffic, etc.

     During tracing and sampling




June 20, 2011                      2011 HPCMP User Group © Cray Inc.   Slide 156
 Call graph profile                    Cray Apprentice2
 Communication statistics              is target to help                 identify and
 Time-line view                           correct:
    Communication                            Load imbalance
    I/O                                      Excessive communication
                                              Network contention
 Activity view
                                              Excessive serialization
 Pair-wise communication statistics          I/O Problems
 Text reports
 Source code mapping




                                                           June 20, 2011                  157
                                 2011 HPCMP User Group © Cray Inc.
Switch Overview display




June 20, 2011   2011 HPCMP User Group © Cray Inc.               158
2011 HPCMP User Group © Cray Inc.   June 20, 2011   Slide 159
June 20, 2011   2011 HPCMP User Group © Cray Inc.   160
June 20, 2011   2011 HPCMP User Group © Cray Inc.   161
Min, Avg, and Max
                                 Values




                                                     -1, +1
                                                     Std Dev
                                                     marks




June 20, 2011   2011 HPCMP User Group © Cray Inc.              162
Width  inclusive time

                                                          Height  exclusive time


                                                                             Filtered
                                                                             nodes or
                                                                             sub tree
Load balance overview:
Height  Max time
Middle bar  Average time
                                                           DUH Button:
Lower bar  Min time
                                                           Provides hints
Yellow represents                                          for performance
imbalance time                                             tuning



          Function
                                                                                    Zoom
          List




June 20, 2011               2011 HPCMP User Group © Cray Inc.                              163
Right mouse click:
                                                                         Node menu
                                                                         e.g., hide/unhide
                                                                         children
                Right mouse click:
                View menu:
                e.g., Filter




                   Sort options
                   % Time,
                   Time,
                   Imbalance %
                   Imbalance time

                  Function
                  List off




June 20, 2011                        2011 HPCMP User Group © Cray Inc.                        164
June 20, 2011   2011 HPCMP User Group © Cray Inc.   165
June 20, 2011   2011 HPCMP User Group © Cray Inc.   Slide 166
June 20, 2011   2011 HPCMP User Group © Cray Inc.   Slide 167
June 20, 2011   2011 HPCMP User Group © Cray Inc.   Slide 168
Min, Avg, and Max
                                 Values




                                                     -1, +1
                                                     Std Dev
                                                     marks




June 20, 2011   2011 HPCMP User Group © Cray Inc.              169
June 20, 2011   2011 HPCMP User Group © Cray Inc.   170
 Cray Apprentice2 panel help


     pat_help – interactive help on the Cray Performance toolset


     FAQ available through pat_help




June 20, 2011                          2011 HPCMP User Group © Cray Inc.   171
 intro_craypat(1)
           Introduces the craypat performance tool
     pat_build
           Instrument a program for performance analysis
     pat_help
           Interactive online help utility
     pat_report
           Generate performance report in both text and for use with GUI
     hwpc(3)
           describes predefined hardware performance counter groups
     papi_counters(5)
           Lists PAPI event counters
           Use papi_avail or papi_native_avail utilities to get list of events when
                running on a specific architecture
June 20, 2011                               2011 HPCMP User Group © Cray Inc.          172
pat_report: Help for -O option:

Available option values are in left column, a prefix can be specified:

  ct                  -O calltree
  defaults            Tables that would appear by default.
  heap                -O heap_program,heap_hiwater,heap_leaks
  io                  -O read_stats,write_stats
  lb                  -O load_balance
  load_balance        -O lb_program,lb_group,lb_function
  mpi                 -O mpi_callers
  ---
  callers             Profile by Function and Callers
  callers+hwpc        Profile by Function and Callers
  callers+src         Profile by Function and Callers,                with Line Numbers
  callers+src+hwpc    Profile by Function and Callers,                with Line Numbers
  calltree            Function Calltree View
  calltree+hwpc       Function Calltree View
  calltree+src        Calltree View with Callsite Line                Numbers
  calltree+src+hwpc   Calltree View with Callsite Line                Numbers
  ...


June 20, 2011                     2011 HPCMP User Group © Cray Inc.             Slide 173
 Interactive by default, or use trailing '.' to just print a topic:


     New FAQ craypat 5.0.0.


     Has counter and counter group information


          % pat_help counters amd_fam10h groups .




June 20, 2011                             2011 HPCMP User Group © Cray Inc.   174
The top level CrayPat/X help topics are listed below.
       A good place to start is:
                overview
       If a topic has subtopics, they are displayed under the heading
       "Additional topics", as below. To view a subtopic, you need
       only enter as many initial letters as required to distinguish
       it from other items in the list. To see a table of contents
       including subtopics of those subtopics, etc., enter:
                toc
       To produce the full text corresponding to the table of contents,
       specify "all", but preferably in a non-interactive invocation:
                pat_help all . > all_pat_help
                pat_help report all . > all_report_help
   Additional topics:
       API                         execute
       balance                     experiment
       build                       first_example
       counters                    overview
       demos                       report
       environment                 run
pat_help (.=quit ,=back ^=up /=top ~=search)
=>

June 20, 2011                          2011 HPCMP User Group © Cray Inc.   Slide 175
2011 HPCMP User Group © Cray Inc.   June 20, 2011   176
 ATP (Abnormal Termination Processing) or What do you do when task a
  causes b to crash
    Load the ATP Module before compiling
    Set ATP_ENABLED before running
 Limitations
    ATP disables core dumping. When ATP is running, an applications crash
     does not produce a core dump.
    When ATP is running, the application cannot be checkpointed.
    ATP does not support threaded application processes.
    ATP has been tested at 10,000 cores. Behavior at core counts greater
     than 10,000 is still being researched.




                                      Cray Proprietary   April 19, 2011      177
Application 926912 is crashing. ATP analysis proceeding...

       Stack walkback for Rank 3 starting:
        _start@start.S:113
        __libc_start_main@libc-start.c:220
        main@testMPIApp.c:83
        foo@testMPIApp.c:47
        raise@pt-raise.c:42
       Stack walkback for Rank 3 done
       Process died with signal 4: 'Illegal instruction'
       View application merged backtrace tree file
'atpMergedBT.dot' with 'statview'
       You may need to 'module load stat'.




                                             Cray Proprietary   April 19, 2011   178
Cray Proprietary   April 19, 2011   179
2011 HPCMP User Group © Cray Inc.   June 20, 2011   180
 What CCM is NOT
   It is Not a virtual machine or any os within an os
   It is NOT an emulator




                                        Cray Proprietary   April 19, 2011   181
 What is CCM Then?
    Provides the runtime environment on compute nodes expected by ISV
     applications
    Dynamically allocates and configures compute nodes at job start
       Nodes are not permanently dedicated to CCM
       Any compute node can be used
       Allocated like any other batch job (on demand)

    MPI and third-party MPI runs over TCP/IP using high-speed network
    Supports standard services: ssh, rsh, nscd, ldap
    Complete root file system on the compute nodes
       Built on top of the Dynamic Shared Libraries (DSL) environment

    Apps run under CCM: Abaqus, Matlab, Castep, Discoverer, Dmo13,
     Mesodyn, Ensight and more

   Under CCM, everything the application can “see” is like a standard Linux
                cluster: Linux OS, x86 processor, and MPI
Cray XT6/XE6 System



                                                                                  ESM Mode Runn
    Compute Nodes
                                                                                  CCM Mode Runn
                                                                                  ESM Mode Idle

     Service Nodes

• Many applications running in Extreme Scalability Mode (ESM)
• Submit CCM application through batch scheduler, nodes reserved
      qsub –l ccm=1 Qname AppScript
•   Previous jobs finish, nodes configured for CCM
•   Executes the batch script and application
•   Other nodes scheduled for ESM or CCM applications as available
•   After CCM job completes, CCM nodes cleared
•   CCM nodes available for ESM or CCM applications
                        Cray Product Roadmap - Presented Under NDA   11/03/2010
 Support MPIs that are configured to work with the OFED stack
 CCM1 supports ISV Applications over TCP/IP only
 CCM2 supports ISV Applications over TCP/IP and Gemini on XE6


 ISV Application Acceleration (IAA) directly utilizes HSN through the
  Gemini user-space APIs.

 Goal of IAA/CCM2 is to deliver latency and bandwidth improvement
  over CCM1 over TCP/IP.

 CCM2 infrastructure is currently in system test.
 IAA design and implementation phase is complete
 CCM2 with IAA is currently in integration test phase
 A code binary compiled for SLES and an Opteron
    DSO’s are OK
 A third party MPI library that can use TCP/IP
    We have tried OpenMPI, HP-MPI, LAM-MPI.
    Most of the bigger apps are packaged with their own library (usually
     HP-MPI)
    Add CCMRUN to the run script.
 The IP address of the License server for the Applications
    Note that right now CCM cannot do an NSLOOKUP
    LMHOSTS must be specified by IP address
 With CLE 4.0: An MPI Library that IBVERBS




                                        Cray Proprietary   April 19, 2011   185
 CCMRUN: Analogous to aprun runs a third party batch job
    In Most cases if you already have a runscript for your third party app
      adding ccmrun prior to the application command will set it up.
 CCMLOGIN: Allows interactive access to the head node of a allocated
  compute pool. Takes Optional ssh options
 CCM uses the ssh known_hosts to set up a an paswordless ssh between a
  set of compute nodes. You can go to allocated nodes but no further.




                                        Cray Proprietary   April 19, 2011     186
 External Login
  Servers
                         XE6
 Internal Login        System

  Nodes (PBS Nodes)
 Compute Nodes
                                                                           External
                                                                         Login Server



                                                            Boot RAID   10 GbE
                      IB QDR




                        Cray Proprietary   April 19, 2011                          187
 External Login Nodes: Dell 4 socket servers that the user enters the
  System Over

 PBS Nodes: Internal single socket 6 core nodes that run the PBS MOM’s
    Aprun must be issued from a node on the System Database


 Compute Nodes: 2 Socket 8 Core Opteron nodes that run trimmed down
  OS (still Linux)




                                        Cray Proprietary   April 19, 2011   188
news: diskuse_work diskuse_home system_info.txt
aminga@garnet01:~> uname -a
Linux garnet01 2.6.27.48-0.12-default #1 SMP 2010-09-20 11:03:26 -0400
x86_64 x86_64 x86_64 GNU/Linux

aminga@garnet01:~> qsub -I -lccm=1 -q debug -l walltime=01:00:00          -l
ncpus=32 -A ERDCS97290STA

qsub: waiting for job 104868.sdb to start
qsub: job 104868.sdb ready
In CCM JOB: 104868.sdb JID sdb USER aminga GROUP erdcssta
Initializing CCM environment, Please Wait
CCM Start success, 2 of 2 responses
aminga@garnet13:~> uname -a
Linux garnet13 2.6.27.48-0.12.1_1.0301.5737-cray_gem_s #1 SMP Mon Mar 28
22:20:59 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux



                                      Cray Proprietary   April 19, 2011        189
aminga@garnet13:~> cat
$PBS_NODEFILE
nid00972   aminga@garnet13:~> ccmlogin
nid00972   Last login: Mon Jun 13 13:03:26 2011 from nid01028
nid00972
nid00972
nid00972
           -------------------------------------------------------------------------------
nid00972
nid00972   aminga@nid00972:~> uname -a
nid00972   Linux nid00972 2.6.27.48-0.12.1_1.0301.5737-cray_gem_c #1
<snip>     SMP Mon Mar 28 22:26:26 UTC 2011 x86_64 x86_64 x86_64
nid01309   GNU/Linux
nid01309
nid01309   aminga@nid00972:~> ssh nid01309
nid01309   Try `uname --help' for more information.
nid01309
           aminga@nid01309:~> uname -a
nid01309
nid01309   Linux nid01309 2.6.27.48-0.12.1_1.0301.5737-cray_gem_c #1
nid01309   SMP Mon Mar 28 22:26:26 UTC 2011 x86_64 x86_64 x86_64
               GNU/Linux
               aminga@nid01309:~>
               aminga@nid00972:~> ssh nid01310
               Redirecting to /etc/ssh/ssh_config
               ssh: connect to host nid01310 portProprietaryConnection refused
                                              Cray 203:       April 19, 2011                 190
#!/bin/csh
#PBS -l mppwidth=2
#PBS -l mppnppn=1
#PBS -q ccm_queue
#PBS -j oe

cd $PBS_O_WORKDIR

perl ConstructMachines.LINUX.pl
setenv DSD_MachineLIST $PBS_O_WORKDIR/machines.LINUX
setenv MPI_COMMAND "
/usr/local/applic/accelrys/MSModeling5.5/hpmpi/opt/hpmpi/bin/mpi
run -np "

ccmrun ./RunDiscover.sh -np 2 nvt_m




                                            Cray Proprietary   April 19, 2011   191
#PBS -l mppwidth=2
#PBS -l mppnppn=1
#PBS -j oe
#PBS -N gauss-test-ccm
#PBS -q ccm_queue

cd $PBS_O_WORKDIR
cp $PBS_NODEFILE node_file
./CreatDefaultRoute.pl
mkdir -p scratch
setenv DVS_CACHE off
setenv g09root /usr/local/applic/gaussian/
setenv GAUSS_EXEDIR ${g09root}/g09
setenv GAUSS_EXEDIR ${g09root}/g09/linda-exe:$GAUSS_EXEDIR
setenv GAUSS_SCRDIR `pwd`
setenv TMPDIR `pwd`
source ${g09root}/g09/bsd/g09.login

setenv GAUSS_LFLAGS "-vv -nodefile node_file -opt Tsnet.Node.lindarsharg:ssh"

setenv LINDA_PATH ${g09root}/g09/linda8.2/opteron-linux
set LINDA_LAUNCHVERBOSE=1

ccmrun ${g09root}/g09/g09 < gauss-test-ccm.com
setenv TEND `echo "print time();" | perl`
echo "Gaussian CCM walltime: `expr $TEND - $TBEGIN` seconds"
                                             Cray Proprietary   April 19, 2011   192
cd $PBS_O_WORKDIR

/bin/rm -rf bhost.def
cat $PBS_NODEFILE > bhost.def

/bin/rm -rf job.script
cat > job.script << EOD
#!/bin/csh
set echo
cd $PWD
setenv AEROSOFT_HOME /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft
setenv LAMHOME /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft
setenv PATH /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft/bin:$PATH

setenv TMPDIR /work/aminga
ln -s /usr/lib64/libpng.so libpng.so.3
setenv LD_LIBRARY_PATH `pwd`:$LD_LIBRARY_PATH

setenv LAMRSH "ssh -x"
lamboot bhost.def

time mpirun -np 2 -x LD_LIBRARY_PATH gasp --mpi -i duct.xml --run 2 --elmhost 140.31.
9.44

EOD



chmod +x job.script
ccmrun job.script
                                                              Cray Proprietary   April 19, 2011   193
#!/bin/sh
#PBS -q ccm_queue
#PBS -lmppwidth=48
#PBS -j oe
#PBS -N CFX

cd $PBS_O_WORKDIR

TOP_DIR=/usr/local/applic/ansys
export ANSYSLIC_DIR=$TOP_DIR/shared_files/licensing
export LD_LIBRARY_PATH=$TOP_DIR/v121/CFX/tools/hpmpi-2.3/Linux-amd64/lib/linux_amd64:
$LD_LIBRARY_PATH
export PATH=$TOP_DIR/v121/CFX/bin:$PATH

export CFX5RSH=ssh
export MPIRUN_OPTIONS="-TCP -prot -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7,8,9,10,11,12,13,1
4,15,16,17,18,19,20,21,22,23"

/bin/rm -rf host.list
cat $PBS_NODEFILE > host.list

export proc_list=`sort host.list | uniq -c | awk '{ printf("%s*%s ", $2, $1) ; }'`
echo $proc_list

which cfx5solve
ccmrun cfx5solve 
      -def S*400k.def -par-dist "$proc_list" -start-method "HP MPI Distributed Para
llel"
rm -f host.list


                                                                       Cray Proprietary   April 19, 2011   194
#!/bin/bash
#PBS -lmppwidth=16
#PBS -q ccm_queue
#PBS -j oe
#PBS -N abaqus_e1

cd $PBS_O_WORKDIR

TMPDIR=.

ABAQUS=/usr/local/applic/abaqus
#cp ${ABAQUS}/input/e1.inp e1.inp
cat $PBS_NODEFILE
echo "Run Abaqus"
ccmrun ${ABAQUS}/6.10-1/exec/abq6101.exe input=e1.inp job=e1 cpus=16
interactive




                                    Cray Proprietary   April 19, 2011   195
#!/bin/csh
#PBS -q ccm_queue
#PBS -l mppwidth=32
#PBS -j oe
#PBS -N AFRL_Fluent
cd $PBS_O_WORKDIR

setenv FLUENT_HOME /usr/local/applic/fluent/12.1/fluent

setenv FLUENT_ARCH lnamd64
setenv PATH /usr/local/applic/fluent/12.1/v121/fluent/bin:$PATH
setenv FLUENT_INC /usr/local/applic/fluent/12.1/v121/fluent
###setenv LM_LICENSE_FILE 7241@10.128.0.72
setenv LM_LICENSE_FILE 27000@10.128.0.76
setenv ANSYSLMD_LICENSE_FILE /home/applic/ansys/shared_files/licensing/license.dat
echo ${LM_LICENSE_FILE}

setenv FLUENT_VERSION -r12.1.1

cd $PBS_O_WORKDIR

rm -rf host.list
cat $PBS_NODEFILE > host.list
module load ccm dot

setenv MALLOC_MMAP_MAX_ 0
setenv MALLOC_TRIM_THRESHOLD_ 536870912
setenv MPIRUN_OPTIONS " -TCP -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7"
setenv MPIRUN_OPTIONS "${MPIRUN_OPTIONS},8,9,10,11,12,13,14,15 "
setenv MPI_SOCKBUFSIZE 524288
setenv MPI_WORKDIR $PWD
setenv MPI_COMMD 1024,1024

ccmrun /usr/local/applic/fluent/v121/fluent/bin/fluent -r12.1.2 2ddp -mpi=hp -gu -dri
ver null -t4 -i blast.inp > tstfluent-blast.jobout                        Cray Proprietary   April 19, 2011   196
 ALPS allows you to run aprun instance per node. Using CCM you can get
  around that.
 So suppose you want to run 16 single core jobs and only use one node
   qsub -lccm=1 -q debug -l walltime=01:00:00 -l ncpus=16 -A
      ERDCS97290STA
   #PBS –J oe
   cd $PBS_O_WORKDIR
   ./myapp&
   ./myapp&
   ./myapp&
   ./myapp&
   ./myapp&
   ./myapp&


                                      Cray Proprietary   April 19, 2011   197
Engineering for Multi-level Parallelism




          2011 HPCMP User Group © Cray Inc.   June 20, 2011   198
 Flat, all-MPI parallelism is beginning to be too limited as the number of
  compute cores rapidly increase
 It is becoming necessary to design applications with multiple levels of
  parallelism:

 High-level MPI parallelism between nodes
    You’re probably already doing this
 Loose, on-node parallelism via threads at a high level
    Most codes today are using MPI, but threading is becoming more
      important
 Tight, on-node, vector parallelism at a low level
    SSE/AVX on CPUs
    GPU threaded parallelism


Programmers need to expose the same parallelism for all future architectures

                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   199
 A benchmark problem was defined to closely resemble the target simulation
     52 species n-heptane chemistry and 483 grid points per node


   – 483 * 18,500 nodes = 2 billion                                               Chemistry
     grid points
   – Target problem would take two
     months on today’s Jaguar
• Code was benchmarked and
  profiled on dual-hexcore XT5
• Several kernels identified and
  extracted into stand-alone driver
  programs
      Mini-Apps!                               Core S3D



                              2011 HPCMP User Group © Cray Inc.   June 20, 2011          200
Goals:
   Convert S3D to a hybrid multi-core application suited for a multi-core node with
    or without an accelerator.
       Hoisted several loops up the call tree
       Introduced high-level OpenMP
   Be able to perform the computation entirely on the accelerator if available.
    - Arrays and data able to reside entirely on the accelerator.
    - Data sent from accelerator to host CPU for halo communication, I/O and
       monitoring only.
  Strategy:
   To program using both hand-written and generated code.
    - Hand-written and tuned CUDA*.
    - Automated Fortran and CUDA generation for chemistry kernels
    - Automated code generation through compiler directives
   S3D kernels are now a part of Cray’s compiler development test cases
* Note: CUDA refers to CUDA-Fortran, unless mentioned otherwise
                                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   201
RHS – Called 6 times for each time step –
Runge Kutta iterations                                   All major loops are at low level of
 Calculate Primary Variable – point wise
                                                         the
 Mesh loops within 5 different routines                  Call tree
                                                         Green – major computation –
                                                         point-wise
 Perform Derivative computation – High
 order differencing
                                                         Yellow – major computation –
                                                         Halos 5 zones thick

 Calculate Diffusion – 3 different
 routines with some derivative
 computation

 Perform Derivative computation for
 forming rhs – lots of communication


 Perform point-wise chemistry
 computation


                                2011 HPCMP User Group © Cray Inc.   June 20, 2011              202
RHS – Called 6 times for each time step –
                     Runge Kutta iterations



                       Calculate Primary Variable – point wise
OMP loop over grid     Mesh loops within 3 different routines


                       Perform Derivative computation – High
                       order differencing
                                                                              Overlapped
OMP loop over grid    Calculate Primary Variable – point wise
                      Mesh loops within 2 different routines


                      Calculate Diffusion – 3 different routines
                      with some derivative computation


                      Perform derivative computation

                                                                              Overlapped
OMP loop over grid    Perform point-wise chemistry
                      computation (1)

                      Perform Derivative computation for
                      forming rhs – lots of communication
                                                                              Overlapped
OMP loop over grid    Perform point-wise chemistry
                      computation (2)

                                          2011 HPCMP User Group © Cray Inc.          June 20, 2011
2011 HPCMP User Group © Cray Inc.   June 20, 2011   204
 Create good granularity OpenMP Loop
 Improves cache re-use
 Reduces Memory usage significantly
 Creates a good potential kernel for an accelerator




                               2011 HPCMP User Group © Cray Inc.               205
                                                               June 20, 2011
CPU Optimizations
Optimizing Communication
    I/O Best Practices




   2011 HPCMP User Group © Cray Inc.   June 20, 2011   206
2011 HPCMP User Group © Cray Inc.   June 20, 2011   207
55. 1                 ii = 0
56. 1 2-----------< do b = abmin, abmax                                                 Poor loop order
57. 1 2 3---------<    do j=ijmin, ijmax                                                 results in poor
58. 1 2 3                 ii = ii+1
                                                                                                striding
59. 1 2 3                jj = 0                                                    The inner-most loop
60. 1 2 3 4-------<       do a = abmin, abmax                                      strides on a slow
61. 1 2 3 4 r8----<        do i = ijmin, ijmax                                     dimension of each
62. 1 2 3 4 r8                  jj = jj+1
                                                                                   array.
63. 1 2 3 4 r8                  f5d(a,b,i,j) = f5d(a,b,i,j)
                                                     + tmat7(ii,jj)                The best the compiler
64. 1 2 3 4 r8                  f5d(b,a,i,j) = f5d(b,a,i,j)                        can do is unroll.
                                                     - tmat7(ii,jj)
65. 1 2 3 4 r8                  f5d(a,b,j,i) = f5d(a,b,j,i)                        Little to no cache
                                                     - tmat7(ii,jj)                reuse.
66. 1 2 3 4 r8                  f5d(b,a,j,i) = f5d(b,a,j,i)
                                                     + tmat7(ii,jj)
67. 1 2 3 4 r8---->        end do
68. 1 2 3 4------->       end do
69. 1 2 3--------->    end do
70. 1 2-----------> end do


                               2011 HPCMP User Group © Cray Inc.   June 20, 2011                         208
USER / #1.Original Loops
-----------------------------------------------------------------                        Poor loop order
 Time%                                                     55.0%                          results in poor
 Time                                                 13.938244 secs                         cache reuse
 Imb.Time                                              0.075369 secs
 Imb.Time%                                                   0.6%                   For every L1 cache
 Calls                              0.1 /sec                  1.0 calls             hit, there’s 2 misses
 DATA_CACHE_REFILLS:
   L2_MODIFIED:L2_OWNED:                                                            Overall, only 2/3 of
   L2_EXCLUSIVE:L2_SHARED       11.858M/sec           165279602 fills               all references were in
 DATA_CACHE_REFILLS_FROM_SYSTEM:                                                    level 1 or 2 cache.
   ALL                          11.931M/sec           166291054 fills
 PAPI_L1_DCM                    23.499M/sec           327533338 misses
 PAPI_L1_DCA                    34.635M/sec           482751044 refs
 User time (approx)             13.938 secs        36239439807 cycles
  100.0%Time
 Average Time per Call                                13.938244 sec
 CrayPat Overhead : Time           0.0%
 D1 cache hit,miss ratios          32.2% hits              67.8% misses
 D2 cache hit,miss ratio           49.8% hits              50.2% misses
 D1+D2 cache hit,miss ratio        66.0% hits              34.0% misses




                               2011 HPCMP User Group © Cray Inc.    June 20, 2011                           209
2011 HPCMP User Group © Cray Inc.   June 20, 2011   210
2011 HPCMP User Group © Cray Inc.   June 20, 2011   211
75. 1 2-----------< do i = ijmin, ijmax
76. 1 2               jj = 0
77. 1 2 3---------<   do a = abmin, abmax                                         Reordered loop
78. 1 2 3 4-------<     do j=ijmin, ijmax
                                                                                  nest
79. 1 2 3 4              jj = jj+1                                                Now, the inner-most
80. 1 2 3 4                  ii = 0                                               loop is stride-1 on
81. 1 2 3 4 Vcr2--<       do b = abmin, abmax                                     both arrays.
82. 1 2 3 4 Vcr2               ii = ii+1
83. 1 2 3 4 Vcr2               f5d(a,b,i,j) = f5d(a,b,i,j)                        Now memory
                                                    + tmat7(ii,jj)                accesses happen
84. 1 2 3 4 Vcr2               f5d(b,a,i,j) = f5d(b,a,i,j)                        along the cache line,
                                                    - tmat7(ii,jj)                allowing reuse.
85. 1 2 3 4 Vcr2               f5d(a,b,j,i) = f5d(a,b,j,i)
                                                    - tmat7(ii,jj)                Compiler is able to
86. 1 2 3 4 Vcr2               f5d(b,a,j,i) = f5d(b,a,j,i)                        vectorize and better-
                                                    + tmat7(ii,jj)                use SSE instructions.
87. 1 2 3 4 Vcr2-->       end do
88. 1 2 3 4------->     end do
89. 1 2 3--------->   end do
90. 1 2-----------> end do


                              2011 HPCMP User Group © Cray Inc.   June 20, 2011                           212
USER / #2.Reordered Loops
-----------------------------------------------------------------                      Improved striding
 Time%                                                     31.4%                        greatly improved
 Time                                                  7.955379 secs                         cache reuse
 Imb.Time                                              0.260492 secs
 Imb.Time%                                                   3.8%                   Runtine was cut
 Calls                               0.1 /sec                 1.0 calls             nearly in half.
 DATA_CACHE_REFILLS:
   L2_MODIFIED:L2_OWNED:                                                            Still, some 20% of all
   L2_EXCLUSIVE:L2_SHARED          0.419M/sec           3331289 fills               references are cache
 DATA_CACHE_REFILLS_FROM_SYSTEM:                                                    misses
   ALL                          15.285M/sec           121598284 fills
 PAPI_L1_DCM                    13.330M/sec           106046801 misses
 PAPI_L1_DCA                    66.226M/sec           526855581 refs
 User time (approx)             7.955 secs         20684020425 cycles
  100.0%Time
 Average Time per Call                                 7.955379 sec
 CrayPat Overhead : Time            0.0%
 D1 cache hit,miss ratios          79.9% hits              20.1% misses
 D2 cache hit,miss ratio            2.7% hits              97.3% misses
 D1+D2 cache hit,miss ratio        80.4% hits              19.6% misses




                               2011 HPCMP User Group © Cray Inc.    June 20, 2011                            213
First loop, partially vectorized and                       Second loop, vectorized and
unrolled by 4                                              unrolled by 4
95.    1                 ii = 0                            109.   1                  jj = 0
96.    1 2-----------< do j = ijmin, ijmax                 110.   1 2-----------< do i = ijmin, ijmax
97.    1 2 i---------<     do b = abmin, abmax             111.   1 2 3---------<       do a = abmin, abmax
98.    1 2 i                ii = ii+1                      112.   1 2 3                     jj = jj+1
99.    1 2 i                jj = 0                         113.   1 2 3                     ii = 0
100.   1 2 i i-------<      do i = ijmin, ijmax            114.   1 2 3 4-------<           do j = ijmin, ijmax
101.   1 2 i i Vpr4--<        do a = abmin, abmax          115.   1 2 3 4 Vr4---<            do b = abmin, abmax
102.   1 2 i i Vpr4               jj = jj+1                116.   1 2 3 4 Vr4                  ii = ii+1
103.   1 2 i i Vpr4               f5d(a,b,i,j) =           117.   1 2 3 4 Vr4                  f5d(b,a,i,j) =
                      f5d(a,b,i,j) + tmat7(ii,jj)                                   f5d(b,a,i,j) - tmat7(ii,jj)
104.   1 2 i i Vpr4               f5d(a,b,j,i) =           118.   1 2 3 4 Vr4                  f5d(b,a,j,i) =
                      f5d(a,b,j,i) - tmat7(ii,jj)                                   f5d(b,a,i,j) + tmat7(ii,jj)
105.   1 2 i i Vpr4-->        end do                       119.   1 2 3 4 Vr4--->            end do
106.   1 2 i i------->      end do                         120.   1 2 3 4------->           end do
107.   1 2 i--------->    end do                           121.   1 2 3--------->       end do
108.   1 2-----------> end do                              122.   1 2-----------> end do




                                        2011 HPCMP User Group © Cray Inc.   June 20, 2011                          214
USER / #3.Fissioned Loops
                                                                                         Fissioning further
-----------------------------------------------------------------                   improved cache reuse
 Time%                                                       9.8%                    and resulted in better
 Time                                                  2.481636 secs                          vectorization
 Imb.Time                                              0.045475 secs
 Imb.Time%                                                   2.1%                   Runtime further
 Calls                               0.4 /sec                 1.0 calls             reduced.
 DATA_CACHE_REFILLS:
   L2_MODIFIED:L2_OWNED:                                                            Cache hit/miss ratio
   L2_EXCLUSIVE:L2_SHARED          1.175M/sec           2916610 fills               improved slightly
 DATA_CACHE_REFILLS_FROM_SYSTEM:
   ALL                          34.109M/sec            84646518 fills               Loopmark file points
 PAPI_L1_DCM                    26.424M/sec            65575972 misses              to better
 PAPI_L1_DCA                  156.705M/sec            388885686 refs
                                                                                    vectorization from
 User time (approx)             2.482 secs          6452279320 cycles
  100.0%Time                                                                        the fissioned loops
 Average Time per Call                                 2.481636 sec
 CrayPat Overhead : Time            0.0%
 D1 cache hit,miss ratios          83.1% hits              16.9% misses
 D2 cache hit,miss ratio            3.3% hits              96.7% misses
 D1+D2 cache hit,miss ratio        83.7% hits              16.3% misses




                               2011 HPCMP User Group © Cray Inc.    June 20, 2011                          215
2011 HPCMP User Group © Cray Inc.   June 20, 2011   216
(     52) C         THE ORIGINAL
(    53)
(     54)          DO 47020   J = 1, JMAX                                          Triple nested loop at a high level
(     55)           DO 47020 K = 1, KMAX
(     56)            DO 47020 I = 1, IMAX
(    57)           JP              =   J   +   1
(    58)           JR              =   J   -   1
(    59)           KP              =   K   +   1
(    60)           KR              =   K   -   1
(
(
     61)
     62)
                   IP
                   IR
                                   =
                                   =
                                       I
                                       I
                                           +
                                           -
                                               1
                                               1
                                                                                             Ifs inside the inner loop can
(     63)                IF (J .EQ. 1)     GO TO 50                                          signifantly reduce the chances
(     64)                 IF( J .EQ. JMAX) GO TO 51
(    65)             XJ = ( A(I,JP,K) -            A(I,JR,K) ) * DA2
                                                                                             of vectorization
(    66)             YJ = ( B(I,JP,K) -            B(I,JR,K) ) * DA2
(    67)             ZJ = ( C(I,JP,K) -            C(I,JR,K) ) * DA2
(    68)             GO TO 70
(    69)      50   J1 = J + 1
(    70)           J2 = J + 2
(    71)           XJ = (-3. * A(I,J,K)            + 4. * A(I,J1,K) - A(I,J2,K) ) * DA2
(    72)           YJ = (-3. * B(I,J,K)            + 4. * B(I,J1,K) - B(I,J2,K) ) * DA2
(    73)           ZJ = (-3. * C(I,J,K)            + 4. * C(I,J1,K) - C(I,J2,K) ) * DA2
(    74)           GO TO 70
(    75)      51   J1 = J - 1
(    76)           J2 = J - 2
(    77)           XJ = ( 3. * A(I,J,K)            - 4. * A(I,J1,K) + A(I,J2,K) ) * DA2
(    78)           YJ = ( 3. * B(I,J,K)            - 4. * B(I,J1,K) + B(I,J2,K) ) * DA2
(    79)           ZJ = ( 3. * C(I,J,K)            - 4. * C(I,J1,K) + C(I,J2,K) ) * DA2
(    80)      70   CONTINUE
(     81)                IF (K .EQ. 1)     GO TO 52
(     82)                 IF (K .EQ. KMAX) GO TO 53
(    83)                XK   = ( A(I,J,KP) - A(I,J,KR) ) * DB2
(    84)                YK   = ( B(I,J,KP) - B(I,J,KR) ) * DB2
(    85)                ZK   = ( C(I,J,KP) - C(I,J,KR) ) * DB2
(    86)                GO   TO 71

    continues…                                           2011 HPCMP User Group © Cray Inc.   June 20, 2011               217
PGI
55, Invariant if transformation
     Loop not vectorized: loop count too small
  56, Invariant if transformation




                      2011 HPCMP User Group © Cray Inc.   June 20, 2011   218
(   141) C       THE RESTRUCTURED
(   142)
(   143)          DO 47029 J = 1, JMAX
(   144)           DO 47029 K = 1, KMAX
(   145)                                                             Stride-1 loop brought inside
(   146)               IF(J.EQ.1)THEN
(   147)                                                             if statements
(   148)          J1         = 2
(   149)          J2         = 3
(   150)               DO 47021 I = 1, IMAX
(   151)           VAJ(I) = (-3. * A(I,J,K) + 4. * A(I,J1,K) - A(I,J2,K) ) * DA2
(   152)           VBJ(I) = (-3. * B(I,J,K) + 4. * B(I,J1,K) - B(I,J2,K) ) * DA2
(   153)           VCJ(I) = (-3. * C(I,J,K) + 4. * C(I,J1,K) - C(I,J2,K) ) * DA2
(   154) 47021    CONTINUE
(   155)
(   156)               ELSE IF(J.NE.JMAX) THEN
(   157)
(   158)          JP         = J+1
(   159)          JR         = J-1
(   160)               DO 47022 I = 1, IMAX
(   161)           VAJ(I) = ( A(I,JP,K) - A(I,JR,K) ) * DA2
(   162)           VBJ(I) = ( B(I,JP,K) - B(I,JR,K) ) * DA2
(   163)           VCJ(I) = ( C(I,JP,K) - C(I,JR,K) ) * DA2
(   164) 47022    CONTINUE
(   165)
(   166)          ELSE
(   167)
(   168)          J1         = JMAX-1
(   169)          J2         = JMAX-2
(   170)          DO 47023 I = 1, IMAX
(   171)           VAJ(I) = ( 3. * A(I,J,K) - 4. * A(I,J1,K) + A(I,J2,K) ) * DA2
(   172)           VBJ(I) = ( 3. * B(I,J,K) - 4. * B(I,J1,K) + B(I,J2,K) ) * DA2
(   173)           VCJ(I) = ( 3. * C(I,J,K) - 4. * C(I,J1,K) + C(I,J2,K) ) * DA2
(   174) 47023    CONTINUE
(   175)
(   176)          ENDIF
Continues…                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011       219
PGI
 144, Invariant if transformation
    Loop not vectorized: loop count too small
 150, Generated 3 alternate loops for the inner loop
    Generated vector sse code for inner loop
    Generated 8 prefetch instructions for this loop
    Generated vector sse code for inner loop
    Generated 8 prefetch instructions for this loop
    Generated vector sse code for inner loop
    Generated 8 prefetch instructions for this loop
    Generated vector sse code for inner loop
    Generated 8 prefetch instructions for this loop
 160, Generated 4 alternate loops for the inner loop
    Generated vector sse code for inner loop
    Generated 6 prefetch instructions for this loop
    Generated vector sse code for inner loop
 ooo

                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   220
2500




  2000



M
  1500
F
L
O
P 1000
S



   500




     0
         0   50   100      150        200        250        300         350       400   450   500
                                            Vector Length

                  CCE-Original - Fortran          CCE-Restructured- Fortran
                  PGI-Original - Fortran          PGI-Restructured - Fortran
                             2011 HPCMP User Group © Cray Inc.    June 20, 2011                221
 Max Vector length doubled to 256 bit
 Much cleaner instruction set
   Result register is unique from the source registers
   Old SSE instruction set always destroyed a source register

 Floating point multiple-accumulate
   A(1:4) = B(1:4)*C(1:4) + D(1:4) ! Now one instruction

 Next gen of both AMD and Intel will have AVX


 Vectors are becoming more important, not less

                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   222
2011 HPCMP User Group © Cray Inc.   June 20, 2011   223
 Cache blocking is a combination of strip mining and loop interchange, designed
  to increase data reuse.
     Takes advantage of temporal reuse: re-reference array elements already
       referenced
     Good blocking will take advantage of spatial reuse: work with the cache
       lines!
 Many ways to block any given loop nest
     Which loops get blocked?
     What block size(s) to use?
 Analysis can reveal which ways are beneficial
 But trial-and-error is probably faster




                              2011 HPCMP User Group © Cray Inc.   June 20, 2011   224
j=1




             j=8
                          2D Laplacian
i=1                             do j = 1, 8
                                   do i = 1, 16
                                      a = u(i-1,j) + u(i+1,j) &
                                          - 4*u(i,j)           &
                                          + u(i,j-1) + u(i,j+1)
                                   end do
                                end do


                          Cache structure for this example:
                                 Each line holds 4 array elements
                                 Cache can hold 12 lines of u data
i=16
                          No cache reuse between outer loop
                   120
                   30
                   18
                   15
                   13
                   12
                   10
                    9
                    7
                    6
                    4
                    3        iterations


                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   225
j=1




             j=8
                         Unblocked loop: 120 cache misses
i=1                      Block the inner loop

                               do IBLOCK = 1, 16, 4
                                  do j = 1, 8
i=5                                  do i = IBLOCK, IBLOCK + 3
                                        a(i,j) = u(i-1,j) + u(i+1,j) &
                                                 - 2*u(i,j)           &
i=9                                              + u(i,j-1) + u(i,j+1)
                                     end do
                                  end do
                               end do
i=13
                         Now we have reuse of the “j+1” data

                   80
                   20
                   12
                   10
                   11
                    9
                    8
                    7
                    6
                    4
                    3



                        2011 HPCMP User Group © Cray Inc.   June 20, 2011   226
j=1


             j=5
                         One-dimensional blocking reduced
i=1                       misses from 120 to 80
                         Iterate over 4×4 blocks

i=5                            do JBLOCK = 1, 8, 4
                                  do IBLOCK = 1, 16, 4
                                     do j = JBLOCK, JBLOCK + 3
                                        do i = IBLOCK, IBLOCK + 3
i=9                                        a(i,j) = u(i-1,j) + u(i+1,j) &
                                                    - 2*u(i,j)           &
                                                    + u(i,j-1) + u(i,j+1)
                                        end do
i=13
                                     end do
                                  end do
                               end do


                   15
                   13
                   12
                   10
                   60
                   30
                   18
                   17
                   16
                   11
                    9
                    8
                    7
                    6
                    4
                    3
                         Better use of spatial locality (cache lines)


                        2011 HPCMP User Group © Cray Inc.   June 20, 2011    227
   Matrix-matrix multiply (GEMM) is the canonical cache-blocking example
   Operations can be arranged to create multiple levels of blocking
      Block for register
      Block for cache (L1, L2, L3)
      Block for TLB
   No further discussion here. Interested readers can see
      Any book on code optimization
             Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in
              Chapter 8
             Insert your favorite book here
        Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication
         algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR-
         2001-22, The University of Texas at Austin, Department of Computer Sciences
             Develops algorithms and cost models for GEMM in hierarchical memories
        Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM
         Transactions on Mathematical Software 34, 3 (May), 1-25
             Description of GotoBLAS DGEMM




                                              2011 HPCMP User Group © Cray Inc.         June 20, 2011                                 228
“I tried cache-blocking my code, but it didn’t help”


 You’re doing it wrong.
    Your block size is too small (too much loop overhead).
    Your block size is too big (data is falling out of cache).
    You’re targeting the wrong cache level (?)
    You haven’t selected the correct subset of loops to block.
 The compiler is already blocking that loop.
 Prefetching is acting to minimize cache misses.
 Computational intensity within the loop nest is very large, making blocking less
  important.



                                2011 HPCMP User Group © Cray Inc.   June 20, 2011   229
 Multigrid PDE solver
 Class D, 64 MPI ranks
                                                       do i3 = 2, 257
    Global grid is 1024 × 1024 × 1024                    do i2 = 2, 257
    Local grid is 258 × 258 × 258                           do i1 = 2, 257
 Two similar loop nests account for                   !         update u(i1,i2,i3)
                                                       !         using 27-point stencil
  >50% of run time
                                                             end do
 27-point 3D stencil                                     end do
    There is good data reuse along                    end doi2+1
                                                                                          cache lines
                                                                i2
      leading dimension, even without                       i2-1
      blocking                                              i3+1


                                                              i3


                                                            i3-1

                                                                   i1-1    i1      i1+1




                              2011 HPCMP User Group © Cray Inc.    June 20, 2011                   230
 Block the inner two loops                                                           Mop/s/proces
                                                                      Block size
 Creates blocks extending along i3 direction                                              s
                                                                      unblocked          531.50
   do I2BLOCK = 2, 257, BS2
      do I1BLOCK = 2, 257, BS1                                          16 × 16          279.89
         do i3 = 2, 257
                                                                        22 × 22          321.26
            do i2 = I2BLOCK,                   &
                    min(I2BLOCK+BS2-1, 257)                             28 × 28          358.96
               do i1 = I1BLOCK,                &
                       min(I1BLOCK+BS1-1, 257)                          34 × 34          385.33
   !              update u(i1,i2,i3)
   !              using 27-point stencil                               40 × 40           408.53
               end do
            end do
                                                                       46 × 46           443.94
         end do                                                         52 × 52          468.58
      end do
   end do                                                               58 × 58          470.32
                                                                       64 × 64           512.03
                                                                        70 × 70          506.92
                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011               Slide 231
 Block the outer two loops                                                             Mop/s/proces
                                                                        Block size
 Preserves spatial locality along i1 direction                                              s
                                                                        unblocked          531.50
    do I3BLOCK = 2, 257, BS3
       do I2BLOCK = 2, 257, BS2                                           16 × 16          674.76
          do i3 = I3BLOCK,                   &
                                                                          22 × 22          680.16
                  min(I3BLOCK+BS3-1, 257)
             do i2 = I2BLOCK,                &                            28 × 28          688.64
                     min(I2BLOCK+BS2-1, 257)
                do i1 = 2, 257                                            34 × 34          683.84
    !              update u(i1,i2,i3)
    !              using 27-point stencil                                40 × 40           698.47
                end do
             end do
                                                                         46 × 46           689.14
          end do                                                          52 × 52          706.62
       end do
    end do                                                                58 × 58          692.57
                                                                         64 × 64           703.40
                                                                          70 × 70          693.87
                                    2011 HPCMP User Group © Cray Inc.   June 20, 2011               Slide 232
2011 HPCMP User Group © Cray Inc.   June 20, 2011   233
(    53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,
    int cola, int colb)
(   54) {
(   55)     int i, j, k;               /* loop counters */
(   56)     int rowc, colc, rowb; /* sizes not passed as arguments */
                                                                                              C pointers
(   57)     double con;                /* constant value */
                                                                                              C pointers don’t carry
(   58)
(   59)     rowb = cola;
                                                                                              the same rules as
(   60)     rowc = rowa;                                                                      Fortran Arrays.
(   61)     colc = colb;
(   62)                                                                                       The compiler has no
(   63)     for(i=0;i<rowc;i++) {                                                             way to know whether
(   64)         for(k=0;k<cola;k++) {
                                                                                              *a, *b, and *c
(   65)             con = *(a + i*cola +k);
(   66)             for(j=0;j<colc;j++) {                                                     overlap or are
(   67)                 *(c + i*colc + j) += con * *(b + k*colb + j);                         referenced differently
(   68)             }                                                                         elsewhere.
(   69)         }
(   70)     }
                                                                                              The compiler must
(   71) }
                                                                                              assume the worst,
mat_mul_daxpy:                                                                                thus a false data
    66, Loop not vectorized: data dependency                                                  dependency.
          Loop not vectorized: data dependency
          Loop unrolled 4 times

                                          2011 HPCMP User Group © Cray Inc.   June 20, 2011                       Slide 234
(    53) void mat_mul_daxpy(double* restrict a, double* restrict b,
    double* restrict c, int rowa, int cola, int colb)
(   54) {
(   55)     int i, j, k;               /* loop counters */
                                                                                              C pointers,
(   56)     int rowc, colc, rowb; /* sizes not passed as arguments */
                                                                                              restricted
(   57)     double con;                /* constant value */
                                                                                              C99 introduces the
(   58)
(   59)     rowb = cola;
                                                                                              restrict keyword,
(   60)     rowc = rowa;                                                                      which allows the
(   61)     colc = colb;                                                                      programmer to
(   62)
                                                                                              promise not to
(   63)     for(i=0;i<rowc;i++) {
(   64)         for(k=0;k<cola;k++) {
                                                                                              reference the
(   65)             con = *(a + i*cola +k);                                                   memory via another
(   66)             for(j=0;j<colc;j++) {                                                     pointer.
(   67)                 *(c + i*colc + j) += con * *(b + k*colb + j);
(   68)             }                                                                         If you declare a
(   69)         }
                                                                                              restricted pointer and
(   70)     }
(   71) }                                                                                     break the rules,
                                                                                              behavior is undefined
                                                                                              by the standard.



                                          2011 HPCMP User Group © Cray Inc.   June 20, 2011                      Slide 235
66, Generated alternate loop with no peeling - executed if loop count <= 24
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
       Generated alternate loop with no peeling and more aligned moves -
  executed if loop count <= 24 and alignment test is passed
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
       Generated alternate loop with more aligned moves - executed if loop
  count >= 25 and alignment test is passed
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop


• This can also be achieved with the PGI safe pragma or –Msafeptr
  compiler option or Pathscale –OPT:alias option

                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   Slide 236
2011 HPCMP User Group © Cray Inc.   June 20, 2011   Slide 237
 GNU malloc library
   malloc, calloc, realloc, free calls
      Fortran dynamic variables
 Malloc library system calls
   Mmap, munmap =>for larger allocations
   Brk, sbrk => increase/decrease heap
 Malloc library optimized for low system memory use
   Can result in system calls/minor page faults




                       2011 HPCMP User Group © Cray Inc.   June 20, 2011   238
 Detecting “bad” malloc behavior
   Profile data => “excessive system time”
 Correcting “bad” malloc behavior
   Eliminate mmap use by malloc
   Increase threshold to release heap memory
 Use environment variables to alter malloc
   MALLOC_MMAP_MAX_ = 0
   MALLOC_TRIM_THRESHOLD_ = 536870912
 Possible downsides
    Heap fragmentation
    User process may call mmap directly
    User process may launch other processes
 PGI’s –Msmartalloc does something similar for you at compile time




                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   239
 Google created a replacement “malloc” library
   “Minimal” TCMalloc replaces GNU malloc
 Limited testing indicates TCMalloc as good or better
  than GNU malloc
    Environment variables not required
    TCMalloc almost certainly better for allocations in
     OpenMP parallel regions
 There’s currently no pre-built tcmalloc for Cray XT/XE,
  but some users have successfully built it.



                     2011 HPCMP User Group © Cray Inc.   June 20, 2011   240
 Linux has a “first touch policy” for memory allocation
    *alloc functions don’t actually allocate your memory
    Memory gets allocated when “touched”
 Problem: A code can allocate more memory than available
    Linux assumed “swap space,” we don’t have any
    Applications won’t fail from over-allocation until the memory is finally
     touched
 Problem: Memory will be put on the core of the “touching” thread
    Only a problem if thread 0 allocates all memory for a node
 Solution: Always initialize your memory immediately after allocating it
    If you over-allocate, it will fail immediately, rather than a strange place
     in your code
    If every thread touches its own memory, it will be allocated on the
     proper socket


                             2011 HPCMP User Group © Cray Inc.   June 20, 2011     Slide 241
This may help both compute and communication.




             2011 HPCMP User Group © Cray Inc.   June 20, 2011   242
 Opterons support 4K, 2M, and 1G pages
   We don’t support 1G pages
   4K pages are used by default
 2M pages are more difficult to use, but…


 Your code may run with fewer TLB misses (hence faster).
   The TLB can address more physical memory with 2M pages than
     with 4K pages
 The Gemini perform better with 2M pages than with 4K pages.
   2M pages use less GEMINI resources than 4k pages (fewer bytes).




                       2011 HPCMP User Group © Cray Inc.   June 20, 2011   243
 Link in the hugetlbfs library into your code ‘-lhugetlbfs’
 Set the HUGETLB_MORECORE env in your run script.
    Example : export HUGETLB_MORECORE=yes
 Use the aprun option –m###h to ask for ### Meg of HUGE
  pages.
    Example : aprun –m500h     (Request 500 Megs of HUGE pages as
     available, use 4K pages thereafter)
    Example : aprun –m500hs (Request 500 Megs of HUGE pages, if
     not available terminate launch)
 Note: If not enough HUGE pages are available, the cost of
  filling the remaining with 4K pages may degrade performance.


                        2011 HPCMP User Group © Cray Inc.   June 20, 2011   244
2011 HPCMP User Group © Cray Inc.   June 20, 2011   245
 Short Message Eager Protocol

    The sending rank “pushes” the message to the receiving rank
    Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less
    Sender assumes that receiver can handle the message
       Matching receive is posted - or -
       Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and
         buffer space (MPICH_UNEX_BUFFER_SIZE) to store the message

 Long Message Rendezvous Protocol

    Messages are “pulled” by the receiving rank
    Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes
    Sender sends small header packet with information for the receiver to pull
     over the data
    Data is sent only after matching receive is posted by receiving rank




                              2011 HPCMP User Group © Cray Inc.   June 20, 2011   246
Match Entries Posted by MPI
                                       Incoming Msg    S                           to handle Unexpected Msgs
                                                       E
                                                       A                              Eager         Rendezvous
                                                       S         App ME
                                                                                   Short Msg ME    Long Msg ME
                                                       T
                                                       A
                                                       R



                                                 STEP 1
             STEP 2
                                             MPI_RECV call
Sender    MPI_SEND call     Receiver        Post ME to Portals                        MPI
RANK 0                      RANK 1                                                 Unexpected
                                                                                     Buffers




                                                                                 (MPICH_UNEX_BUFFER_SIZE)
               STEP 3
          Portals DMA PUT                                           Unexpected
                                                                    Msg Queue




                                                           Other Event Queue
                                                       (MPICH_PTL_OTHER_EVENTS)


                                                                                             Unexpected
                                                                                             Event Queue
  MPI_RECV is posted prior to MPI_SEND call                                          (MPICH_PTL_UNEX_EVENTS)


                            2011 HPCMP User Group © Cray Inc.    June 20, 2011                             247
MPT Eager Protocol on SeaStar                                                            Match Entries Posted by MPI
Data “pushed” to the receiver                   Incoming Msg   S                         to handle Unexpected Msgs
(MPICH_MAX_SHORT_MSG_SIZE bytes or less)                       E
                                                               A                            Eager         Rendezvous
                                                               S                         Short Msg ME    Long Msg ME
                                                               T
                                                               A
                                                               R


                                                           STEP 3
                   STEP 1                               MPI_RECV call
 Sender         MPI_SEND call        Receiver           No Portals ME
                                                                                             MPI
 RANK 0                               RANK 1                                              Unexpected
                                                                                            Buffers




                      STEP 2
                                                           STEP 4                 (MPICH_UNEX_BUFFER_SIZE)
                 Portals DMA PUT
                                                        Memcpy of data



                                                                         Unexpected
                                                                         Msg Queue




                                                                                                   Unexpected
                                                                                                   Event Queue
                                                                                              (MPICH_PTL_UNEX_EVENTS)
   MPI_RECV is not posted prior to MPI_SEND call
                                     2011 HPCMP User Group © Cray Inc.   June 20, 2011                           248
Match Entries Posted by MPI
                                            Incoming Msg    S                            to handle Unexpected Msgs
                                                            E
                                                            A
                                                                                            Eager         Rendezvous
                                                            S
                                                            T                            Short Msg ME    Long Msg ME
                                                            A
 App ME                                                     R

                STEP 1                                   STEP 3
           MPI_SEND call                             MPI_RECV call
          Portals ME created                      Triggers GET request
Sender                           Receiver
                                                                                             MPI
RANK 0                           RANK 1
                                                                                          Unexpected
                  STEP 2
                                                                                            Buffers
             Portals DMA PUT
                 of Header




                 STEP 4
              Receiver issues
              GET request to
             match Sender ME
                                                                         Unexpected
                                                                         Msg Queue


                  STEP 5
           Portals DMA of Data


                                                                                                   Unexpected
                                                                                                   Event Queue
    Data is not sent until MPI_RECV is issued

                                 2011 HPCMP User Group © Cray Inc.       June 20, 2011                           249
2011 HPCMP User Group © Cray Inc.   June 20, 2011   250
 The default ordering can be changed using the following
  environment variable:
       MPICH_RANK_REORDER_METHOD
 These are the different values that you can set it to:
       0: Round-robin placement – Sequential ranks are placed on the next node in the
          list. Placement starts over with the first node upon reaching the end of the list.
       1: SMP-style placement – Sequential ranks fill up each node before moving to the
          next.
       2: Folded rank placement – Similar to round-robin placement except that each pass
          over the node list is in the opposite direction of the previous pass.
       3: Custom ordering. The ordering is specified in a file named
          MPICH_RANK_ORDER.
 When is this useful?
    Point-to-point communication consumes a significant fraction of program time and a
     load imbalance detected
    Also shown to help for collectives (alltoall) on subcommunicators (GYRO)
    Spread out IO across nodes (POP)

                                2011 HPCMP User Group © Cray Inc.   June 20, 2011        251
 One can also use the CrayPat performance measurement tools to generate a
  suggested custom ordering.
    Available if MPI functions traced (-g mpi or –O apa)
    pat_build –O apa my_program
       see Examples section of pat_build man page
 pat_report options:
    mpi_sm_rank_order
       Uses message data from tracing MPI to generate suggested MPI rank
        order. Requires the program to be instrumented using the pat_build -g
        mpi option.
    mpi_rank_order
       Uses time in user functions, or alternatively, any other metric specified
        by using the -s mro_metric options, to generate suggested MPI rank
        order.

                              2011 HPCMP User Group © Cray Inc.   June 20, 2011     252
 module load xt-craypat
 Rebuild your code
 pat_build –O apa a.out
 Run a.out+pat
 pat_report –Ompi_sm_rank_order a.out+pat+…sdt/ > pat.report
 Creates MPICH_RANK_REORDER_METHOD.x file
 Then set env var MPICH_RANK_REORDER_METHOD=3     AND
 Link the file MPICH_RANK_ORDER.x to MPICH_RANK_ORDER
 Rerun code




                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   253
Table 1:   Suggested MPI Rank Order


 Eight cores per node:       USER Samp per node
 Rank        Max      Max/            Avg        Avg/      Max Node
Order   USER Samp     SMP     USER Samp            SMP     Ranks
    d      17062     97.6%        16907       100.0%       832,328,820,797,113,478,898,600
    2      17213     98.4%        16907       100.0%       53,202,309,458,565,714,821,970
    0      17282     98.8%        16907       100.0%       53,181,309,437,565,693,821,949
    1      17489    100.0%        16907       100.0%       0,1,2,3,4,5,6,7


        •This suggests that
             1. the custom ordering “d” might be the best
             2. Folded-rank next best
             3. Round-robin 3rd best
             4. Default ordering last




                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011          254
 GYRO 8.0
    B3-GTC problem with 1024 processes
 Run with alternate MPI orderings
    Custom: profiled with with –O apa and used reordering file
     MPICH_RANK_REORDER.d


        Reorder method                         Comm. time
               Default                            11.26s
                                                                                 CrayPAT
         0 – round-robin                          6.94s                          suggestion
                                                                                 almost right!
          2 – folded-rank                         6.68s
        d-custom from apa                         8.03s

                             2011 HPCMP User Group © Cray Inc.   June 20, 2011                   255
 TGYRO 1.0
    Steady state turbulent transport code using GYRO, NEO, TGLF components
 ASTRA test case
    Tested MPI orderings at large scale
    Originally testing weak-scaling, but found reordering very useful




        Reorder            TGYRO wall time (min)
        method            20480  40960      81920
                                                                                  Huge win!
       Default             99m    104m      105m
     Round-robin           66m                63m                 72m

                              2011 HPCMP User Group © Cray Inc.   June 20, 2011          256
2011 HPCMP User Group © Cray Inc.   June 20, 2011   257
Time % |        Time |   Imb. Time |   Imb.   |         Calls |Experiment=1
       |             |             | Time %   |               |Group
       |             |             |          |               | Function
       |             |             |          |               | PE='HIDE'

 100.0% | 1530.892958 |         -- |     -- | 27414118.0 |Total
|---------------------------------------------------------------------
| 52.0% | 796.046937 |                -- |      -- | 22403802.0 |USER
||--------------------------------------------------------------------
|| 22.3% | 341.176468 |     3.482338 |   1.0% | 19200000.0 |getrates_
|| 17.4% | 266.542501 | 35.451437 | 11.7% |         1200.0 |rhsf_
||   5.1% |   78.772615 |   0.532703 |   0.7% | 3200000.0 |mcavis_new_looptool_
||   2.6% |   40.477488 |   2.889609 |   6.7% |     1200.0 |diffflux_proc_looptool_
||   2.1% |   31.666938 |   6.785575 | 17.6% |       200.0 |integrate_erk_jstage_lt_
||   1.4% |   21.318895 |   5.042270 | 19.1% |      1200.0 |computeheatflux_looptool_
||   1.1% |   16.091956 |   6.863891 | 29.9% |         1.0 |main
||====================================================================
| 47.4% | 725.049709 |                -- |      -- | 5006632.0 |MPI
||--------------------------------------------------------------------
|| 43.8% | 670.742304 | 83.143600 | 11.0% | 2389440.0 |mpi_wait_
||   1.9% |   28.821882 | 281.694997 | 90.7% | 1284320.0 |mpi_isend_
|=====================================================================




                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011     258
Time % |        Time |   Imb. Time |   Imb.   |         Calls |Experiment=1
       |             |             | Time %   |               |Group
       |             |             |          |               | Function
       |             |             |          |               | PE='HIDE'

 100.0% | 1730.555208 |         -- |     -- | 16090113.8 |Total
|---------------------------------------------------------------------
|   76.9% | 1330.111350 |             -- |           -- |     4882627.8 |MPI
||--------------------------------------------------------------------
|| 72.1% | 1247.436960 | 54.277263 |     4.2% | 2389440.0 |mpi_wait_
||   1.3% |   22.712017 | 101.212360 | 81.7% | 1234718.3 |mpi_isend_
||   1.0% |   17.623757 |   4.642004 | 20.9% |         1.0 |mpi_comm_dup_
||   1.0% |   16.849281 | 71.805979 | 81.0% | 1234718.3 |mpi_irecv_
||   1.0% |   16.835691 | 192.820387 | 92.0% |     19999.2 |mpi_waitall_
||====================================================================
| 22.2% | 384.978417 |           -- |     -- | 11203802.0 |USER
||--------------------------------------------------------------------
||   9.9% | 171.440025 |    1.929439 |   1.1% | 9600000.0 |getrates_
||   7.7% | 133.599580 | 19.572807 | 12.8% |        1200.0 |rhsf_
||   2.3% |   39.465572 |   0.600168 |   1.5% | 1600000.0 |mcavis_new_looptool_
|=====================================================================
|=====================================================================




                                  2011 HPCMP User Group © Cray Inc.   June 20, 2011   259
MPI Task K +1
             MPI Task K
             MPI Task K - 1

                                                Differencing in the X direction




MPI Task K +30
MPI Task K                                       MPI Task K +1200
MPI Task K-30                                    MPI Task K
                                                 MPI Task K-1200


                                                               Differencing in the Z direction
    Differencing in the Y direction




                               2011 HPCMP User Group © Cray Inc.   June 20, 2011                 260
Code must perform one communication across each surface of a cube
12 cubes perform 72 communications, 63 of which go “off node”


           Optimized mapping of the MPI tasks on the node
           Still performs 72 communications, but now only 32 are off node




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   261
Rank Reordering
                                                    Case Study
                                                    Application data is in
                                                    a 3D space, X x Y x Z.

                                                    Communication is
                                                    nearest-neighbor.

                                                    Default ordering
                                                    results in 12x1x1
                                                    block on each node.

                                                    A custom reordering
                                                    is now generated:
                                                    3x2x2 blocks per
                                                    node, resulting in
                                                    more on-node
                                                    communication




2011 HPCMP User Group © Cray Inc.   June 20, 2011                            262
% pat_report -O mpi_sm_rank_order -s rank_grid_dim=8,6 ...

Notes for table 1:

    To maximize the locality of point to point communication,
    specify a Rank Order with small Max and Avg Sent Msg Total Bytes
    per node for the target number of cores per node.

    To specify a Rank Order with a numerical value, set the environment
    variable MPICH_RANK_REORDER_METHOD to the given value.

    To specify a Rank Order with a letter value 'x', set the environment
    variable MPICH_RANK_REORDER_METHOD to 3, and copy or link the file
    MPICH_RANK_ORDER.x to MPICH_RANK_ORDER.

Table 1:   Sent Message Stats and Suggested MPI Rank Order

                 Communication Partner Counts

        Number    Rank
      Partners   Count      Ranks

             2        4     0   5    42     47
             3       20     1   2     3      4    ...
             4       24     7   8     9     10    ...
                                2011 HPCMP User Group © Cray Inc.   June 20, 2011   Slide 263
Four cores per node:       Sent Msg Total Bytes per node

 Rank           Max    Max/             Avg                     Avg/           Max Node
Order   Total Bytes     SMP     Total Bytes                      SMP           Ranks

    g     121651200    73.9%          86400000                 62.5%           14,20,15,21
    h     121651200    73.9%          86400000                 62.5%           14,20,21,15
    u     152064000    92.4%         146534400                106.0%           13,12,10,4
    1     164505600   100.0%         138240000                100.0%           16,17,18,19
    d     164505600   100.0%         142387200                103.0%           16,17,19,18
    0     224640000   136.6%         207360000                150.0%           1,13,25,37
    2     241920000   147.1%         207360000                150.0%           7,16,31,40




                          2011 HPCMP User Group © Cray Inc.    June 20, 2011                 Slide 264
% $CRAYPAT_ROOT/sbin/grid_order -c 2,2 -g 8,6

# grid_order -c 2,2 -g 8,6
# Region 0: 0,0 (0..47)
0,1,6,7
2,3,8,9
4,5,10,11
12,13,18,19
14,15,20,21
16,17,22,23
24,25,30,31
26,27,32,33
28,29,34,35
36,37,42,43
38,39,44,45
40,41,46,47

This script will also handle the case that cells do not
evenly partition the grid.

                    2011 HPCMP User Group © Cray Inc.   June 20, 2011   Slide 265
X   X   o    o
 X   X   o    o
 o   o   o    o
 o   o   o    o



 Nodes marked X heavily use a shared resource
 If memory bandwidth, scatter the X's
 If network bandwidth to others, again scatter
 If network bandwidth among themselves, concentrate




                              2011 HPCMP User Group © Cray Inc.   June 20, 2011   Slide 267
2011 HPCMP User Group © Cray Inc.   June 20, 2011   268
Call mpi_send(a, 10, …)
Call mpi_send(b, 10, …)         Each message incurs latency and library overhead
Call mpi_send(c, 10, …)
Call mpi_send(d, 10, …)


 Copy messages into a contiguous buffer and send once

Sendbuf(1:10) = a(1:10)
Sendbuf(11:20) = b(1:10)
Sendbuf(21:30) = c(1:10)
Sendbuf(31:40) = d(1:10)

Call mpi_send(sendbuf, 40, …)     Latency and library overhead
                                  incurred only once

 Effectiveness of this optimization is machine dependent

                                2011 HPCMP User Group © Cray Inc.   June 20, 2011   269
 Most collectives have been tuned to take advantage of
  algorithms and hardware to maximize performance
    MPI_ALLTOALL
       Reorder communications to spread traffic around the network efficiently
   MPI_BCAST/_REDUCE/_ALLREDUCE
       Use tree based algorithms to reduce the number of messages.
       Needs to strike a balance between width and depth of tree.
   MPI_GATHER
       Use tree algorithm to reduce resource contention aggregate messages.



 You don’t want to have to reinvent the wheel




                               2011 HPCMP User Group © Cray Inc.   June 20, 2011   270
 MPI_ALLTOALL
    Message size decreases as number of ranks grows
    Number of messages is O(num_ranks2)
    Very difficult to scale to very high core counts
 MPI_BCAST/_REDUCE/_ALLREDUCE/_BARRIER
    All are O(log (num_ranks))
    All represent global sync points
    Expose ANY load imbalance in the code
    Expose ANY “jitter” induced by the OS or other services
 MPI_GATHER
    Many-to-one


 The greater the frequency of collectives, the harder it will be to scale
                           2011 HPCMP User Group © Cray Inc.   June 20, 2011   271
2011 HPCMP User Group © Cray Inc.   June 20, 2011   272
Filesystem                                    Program
 Lustre, GPFS, and Panasas are                Just as a problem gets partitioned
  “parallel filesystems”                        to multiple processors, I/O
 I/O operations are broken down to             operations can be done in parallel
  basic units and distributed to               MPI-IO is a standard API for doing
  multiple endpoints                            parallel I/O operations
 Spreading out operations in this             By performing I/O operations in
  way can greatly improve                       parallel, an application can reduce
  performance at large processor                I/O bottlenecks and take
  counts                                        advantage of parallel filesystems
                                               HDF5, NetCDF, and ADIOS all
                                                provide parallel I/O in a portable
                                                file format

                           2011 HPCMP User Group © Cray Inc.   June 20, 2011          273
 To maximize I/O performance, parallel filesystems
    Break I/O operations into chunks, much like inodes on standard filesystems,
     which get distributed among I/O servers
    Provide a means of controlling how much concurrency to use for a given file
    Make the distributed nature of the data invisible to the program/programmer
 File metadata may be distributed (GPFS) or centralized (Lustre)
 In order to take advantage of a parallel filesystem, a user must
    Ensure that multiple processes are sharing I/O duties, one process is incapable
     of saturating the filesystem
    Prevent multiple processes from using the same “chunk” simultaneously
     (more important with writes)
    Choose a concurrency that is “distributed enough” without spreading data too
     thin to be effective (ideally, 1 process shouldn’t need to access several I/O
     servers)


                              2011 HPCMP User Group © Cray Inc.   June 20, 2011   274
 I/O is simply data migration.
       Memory          Disk
  I/O is a very expensive operation.
     Interactions with data in memory and on disk.
     Must get the kernel involved
  How is I/O performed?
     I/O Pattern
         Number of processes and files.
         File access characteristics.
  Where is I/O performed?
     Characteristics of the computational system.
     Characteristics of the file system.

275                          2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                             J
 There is no “One Size Fits All” solution to the I/O
   problem.
  Many I/O patterns work well for some range of
   parameters.
  Bottlenecks in performance can occur in many
   locations. (Application and/or File system)
  Going to extremes with an I/O pattern will
   typically lead to problems.




276                           2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                              J
 The best performance comes from situations when the data is accessed
      contiguously in memory and on disk.
        Facilitates large operations and minimizes latency.

               Memory                                                   Disk


  Commonly, data access is contiguous in memory but noncontiguous on disk
      or vice versa. Usually to reconstruct a global data structure via parallel I/O.

              Memory                                                   Disk




277                                   2011 HPCMP User Group © Cray Inc.
                                                                      June 20, 2011
 Spokesperson
       One process performs I/O.
          Data Aggregation or
           Duplication
          Limited by single I/O
           process.
       Pattern does not scale.
          Time increases linearly
           with amount of data.                                       Disk
          Time increases with
           number of processes.



278                           2011 HPCMP User Group © Cray Inc.
                                                              June 20, 2011
 File per process
       All processes perform I/O to
        individual files.
           Limited by file system.
       Pattern does not scale at large
        process counts.
           Number of files creates
            bottleneck with metadata                                       Disk
            operations.
           Number of simultaneous
            disk accesses creates
            contention for file system
            resources.




279                                2011 HPCMP User Group © Cray Inc.
                                                                   June 20, 2011
 Shared File
       Each process performs I/O
        to a single file which is
        shared.
       Performance
          Data layout within the
           shared file is very
                                                                              Disk
           important.
          At large process counts
           contention can build
           for file system
           resources.



280                           2011 HPCMP User Group © Cray Inc.
                                                              June 20, 2011
 Subset of processes which perform I/O.
       Aggregation of a group of processes data.
           Serializes I/O in group.
       I/O process may access independent files.
           Limits the number of files accessed.
       Group of processes perform parallel I/O to a shared file.
           Increases the number of shared files to increase file system usage.
           Decreases number of processes which access a shared file to
            decrease file system contention.




281                                    2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                       J
 128 MB per file and a 32 MB Transfer size


                                                             File Per Process
                                                            Write Performance
                           12000


                           10000


                           8000
            Write (MB/s)




                                                                                                                    1 MB Stripe
                           6000
                                                                                                                    32 MB Stripe

                           4000


                           2000


                              0
                                   0   1000   2000   3000     4000     5000    6000     7000     8000        9000

                                                        Processes or Files


282                                                         2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                                            J
 32 MB per process, 32 MB Transfer size and Stripe size

                                                     Single Shared File
                                                     Write Performance
                          8000

                          7000

                          6000
           Write (MB/s)




                                                                                                             POSIX
                          5000

                          4000                                                                               MPIIO

                          3000
                                                                                                             HDF5

                          2000

                          1000

                             0
                                 0   1000   2000   3000   4000   5000     6000    7000     8000    9000

                                                          Processes



283                                                         2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                                            J
 Lustre
       Minimize contention for file system resources.
       A process should not access more than one or two OSTs.
   Performance
       Performance is limited for single process I/O.
       Parallel I/O utilizing a file-per-process or a single shared file is limited at
        large scales.
       Potential solution is to utilize multiple shared file or a subset of
        processes which perform I/O.




284                                   2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                      J
 Standard Ouput and Error streams are
   effectively serial I/O.
  All STDIN, STDOUT, and STDERR I/O
   serialize through aprun
  Disable debugging messages when
   running in production mode.
     “Hello, I’m task 32000!”
     “Task 64000, made it through                                         Lustre
       loop.”




285                        2011 HPCMP User Group © Cray Inc.
                                                           June 20, 2011
 Advantages
      Aggregates smaller read/write
       operations into larger operations.
      Examples: OS Kernel Buffer, MPI-IO
       Collective Buffering
   Disadvantages                                                          Buffer
      Requires additional memory for the
       buffer.
      Can tend to serialize I/O.
   Caution
      Frequent buffer flushes can adversely
       affect performance.




286                               2011 HPCMP User Group © Cray Inc.
                                                                  June 20, 2011
 If an application does extremely small, irregular I/O, explicitly buffering may improve
  performance.
 A post processing application writes a 1GB file.
       This case study is an extreme example.
 This occurs from one writer, but occurs in many small write operations.
       Takes 1080 s (~ 18 minutes) to complete.
 IOBUF was utilized to intercept these writes with 64 MB buffers.
       Takes 4.5 s to complete. A 99.6% reduction in time.
                                                                                        Lustre

File "ssef_cn_2008052600f000"
                   Calls         Seconds                   Megabytes           Megabytes/sec     Avg Size
   Open                1        0.001119
   Read              217        0.247026                   0.105957                   0.428931       512
   Write         2083634        1.453222                1017.398927                 700.098632       512
   Close               1        0.220755
   Total         2083853        1.922122                1017.504884                 529.365466        512
   Sys Read            6        0.655251                 384.000000                 586.035160   67108864
   Sys Write          17        3.848807                1081.145508                 280.904052   66686072
   Buffers used            4 (256 MB)
   Prefetches              6
   Preflushes             15


287                                     2011 HPCMP User Group © Cray Inc.
                                                                        June 20, 2011
 Writing a big-endian binary file with compiler
      flag byteswapio
 File “XXXXXX"
                    Calls     Megabytes                    Avg Size
      Open              1
      Write       5918150   23071.28062                            4088
      Close             1
      Total       5918152   23071.28062                            4088


  Writing a little-endian binary
 File “XXXXXX"
                    Calls     Megabytes                    Avg Size
      Open              1
      Write           350   23071.28062                    69120000
      Close             1
      Total           352   23071.28062                    69120000


288                         2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                            J
 MPI-IO allows multiple MPI processes to access the same file in a distributed
    manner
   Like other MPI operations, it’s necessary to provide a data type for items being
    written to the file (may be a derived type)
   There are 3 ways to declare the “file position”
      Explicit offset: each operation explicitly declares the necessary file offset
      Individual File Pointers: Each process has its own unique handle to the file
      Shared File Pointers: The MPI library maintains 1 file pointer and determines
        how to handle parallel access (often via serialization)
   For each file position type, there are 2 “coordination” patterns
      Non-collective: Each process acts on its own behalf
      Collective: The processes coordinate, possibly allowing the library to make
        smart decisions on how to access the filesystem
   MPI-IO allows the user to provide “hints” to improve I/O performance. Often I/O
    performance can be improved via hints about the filesystem or problem-specific
    details
                                2011 HPCMP User Group © Cray Inc.   June 20, 2011   289
int mode, ierr;
char tmps[24];
MPI_File fh;
MPI_Info info;                                Open a file across all ranks as read/write.
MPI_Status status;                                   Hints can be set between
                                               MPI_Info_create and MPI_File_open.
mode = MPI_MODE_CREATE|MPI_MODE_RDWR;
MPI_Info_create(&info);
MPI_File_open(comm, "output/test.dat", mode, info, &fh);

                                 Set the “view” (offset) for each rank.
MPI_File_set_view(fh, commrank*iosize, MPI_DOUBLE, MPI_DOUBLE, "native",
   info);

                                    Collectively write from all ranks.

MPI_File_write_all(fh, dbuf, iosize/sizeof(double), MPI_DOUBLE, &status);

                          Close the file from all ranks.
MPI_File_close(&fh);

                             2011 HPCMP User Group © Cray Inc.   June 20, 2011          290
 Several parallel libraries are available to provide a portable, metadata-rich file
   format
 On Cray machines, it’s possible to set MPI-IO hints in your environment to improve
   out-of-the-box performance
 HDF5 (http://www.hdfgroup.org/HDF5/)
     Has long supported parallel file access
     Currently in version 1.8
 NetCDF (http://www.unidata.ucar.edu/software/netcdf/)
     Multiple parallel implementations of NetCDF exist
     Beginning with version 4.0, HDF5 is used under the hood to provide native
      support for parallel file access.
     Currently inversion 4.0.
 ADIOS ( http://adiosapi.org)
     Fairly young library in development by ORNL, GA Tech, and others
     Has a native file format, but also supports POSIX, NetCDF, HDF5, and other file
      formats
     Version 1.0 was released at SC09.

                                 2011 HPCMP User Group © Cray Inc.   June 20, 2011      291
 Parallel Filesystems
      Minimize contention for file system resources.
      A process should not access more than one or two OSTs.
      Ideally I/O Buffers and Filesystem “Chunk” sizes should
       match evenly to avoid locking
   Performance
      Performance is limited for single process I/O.
      Parallel I/O utilizing a file-per-process or a single shared file
       is limited at large scales.
      Potential solution is to utilize multiple shared file or a
       subset of processes which perform I/O.
      Large buffer will generally perform best


292                             2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                J
Load the IOBUF module:
% module load iobuf
Relink the program. Set the IOBUF_PARAMS environment variable as needed.
% setenv IOBUF_PARAMS='*:verbose‘
Execute the program.

 IOBUF has a large number of options for tuning behavior from file to file.
  See man iobuf for details.
 May significantly help codes that write a lot to stdout or stderr.




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   293
 A particular code both reads and writes a 377 GB file. Runs on 6000 cores.
       Total I/O volume (reads and writes) is 850 GB.
       Utilizes parallel HDF5
   Default Stripe settings: count 4, size 1M, index -1.
       1800 s run time (~ 30 minutes)
   Stripe settings: count -1, size 1M, index -1.
       625 s run time (~ 10 minutes)
   Results
       66% decrease in run time.



                                                                                    Lustre




294                                 2011 HPCMP User Group © Cray Inc.
                                                                    June 20, 2011
 Included in the Cray MPT library.
   Environmental variable used to help MPI-IO optimize I/O performance.
       MPICH_MPIIO_CB_ALIGN Environmental Variable. (Default 2)
       MPICH_MPIIO_HINTS Environmental Variable
       Can set striping_factor and striping_unit for files created with MPI-IO.
       If writes and/or reads utilize collective calls, collective buffering can be
        utilized (romio_cb_read/write) to approximately stripe align I/O within
        Lustre.




295                                  2011 HPCMP User Group © Cray Inc. une 20, 2011
                                                                     J
MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and
transfers both of 1M bytes and a strided access pattern. Tested on an
XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220
segments, 96 GB file
             1800
             1600
    MB/Sec




             1400
             1200
             1000
             800
             600
             400
             200
               0



                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   296
MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and
transfers both of 10K bytes and a strided access pattern. Tested on an
XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220
segments, 96 GB file

             160
             140
    MB/Sec




             120
             100
              80
              60
              40
              20
               0




                         2011 HPCMP User Group © Cray Inc.   June 20, 2011   297
On 5107 PEs, and by application design, a subset of the Pes(88), do the
writes. With collective buffering, this is further reduced to 22 aggregators
(cb_nodes) writing to 22 stripes. Tested on an XT5 with 5107 Pes, 8
cores/node

              4000
              3500
              3000
              2500
     MB/Sec




              2000
               1500
              1000
                500
                  0




                          2011 HPCMP User Group © Cray Inc.   June 20, 2011   298
Total file size 6.4 GiB. Mesh of 64M bytes 32M elements, with work divided
amongst all PEs. Original problem was very poor scaling. For example, without
collective buffering, 8000 PEs take over 5 minutes to dump. Note that disabling
data sieving was necessary. Tested on an XT5, 8 stripes, 8 cb_nodes


               1000
                                                                                w/o CB
                100                                                             CB=0
     Seconds




                                                                                CB=1
                                                                                CB=2
                 10


                  1


                                       PEs
                            2011 HPCMP User Group © Cray Inc.   June 20, 2011            299
 Do not open a lot of files all at once (Metadata Bottleneck)
 Use a simple ls (without color) instead of ls -l (OST Bottleneck)
 Remember to stripe files
    Small, individual files => Small stripe counts
    Large, shared files => Large stripe counts
 Never set an explicit starting OST for your files (Filesystem Balance)
 Open Files as Read-Only when possible
 Limit the number of files per directory
 Stat files from just one processes
 Stripe-align your I/O (Reduces Locks)
 Read small, shared files once and broadcast the data (OST Contention)




                             2011 HPCMP User Group © Cray Inc.   June 20, 2011   300
 Adaptable IO System (ADIOS)
    http://www.olcf.ornl.gov/center-projects/adios/
 “Optimizing MPI-IO for Applications on Cray XT System” (CrayDoc S-0013-
  10)
 “A Pragmatic Approach to Improving the Large-scale Parallel I/O
  Performance of Scientic Applications.” Crosby, et al. (CUG 2011)




                            2011 HPCMP User Group © Cray Inc.   June 20, 2011   301
2011 HPCMP User Group © Cray Inc.   June 20, 2011   302
2011 HPCMP User Group © Cray Inc.   June 20, 2011   303

HPCMPUG2011 cray tutorial

  • 2.
     Review ofXT6 Architecture  AMD Opteron  Cray Networks  Lustre Basics  Programming Environment  PGI Compiler Basics  The Cray Compiler Environment  Cray Scientific Libraries  Cray Message Passing Toolkit  Cray Performance Analysis Tools  ATP  CCM  Optimizations  CPU  Communication  I/O 2011 HPCMP User Group © Cray Inc. June 20, 2011 2
  • 3.
    AMD CPU Architecture Cray Architecture Lustre Filesystem Basics 2011 HPCMP User Group © Cray Inc. June 20, 2011 3
  • 4.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 4
  • 5.
    2003 2005 2007 2008 2009 2010 AMD AMD “Barcelona” “Shanghai” “Istanbul” “Magny-Cours” Opteron™ Opteron™ Mfg. 130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB Hyper Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333 2011 HPCMP User Group © Cray Inc. June 20, 2011 5
  • 6.
    12 cores 1.7-2.2Ghz 1 4 7 10 105.6Gflops 8 cores 5 11 1.8-2.4Ghz 2 8 76.8Gflops Power (ACP) 3 6 9 12 80Watts Stream 27.5GB/s Cache 12x 64KB L1 12x 512KB L2 12MB L3 2011 HPCMP User Group © Cray Inc. June 20, 2011 6
  • 7.
    L3 cache HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache MEMORY CONTROLLER Core 2 MEMORY CONTROLLER Core 5 Core 8 Core 11 HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache Core 1 Core 4 Core 7 Core 10 L2 cache L2 cache L2 cache L2 cache Core 0 Core 3 Core 6 Core 9 2011 HPCMP User Group © Cray Inc. June 20, 2011 7
  • 8.
     A cacheline is 64B  Unique L1 and L2 cache attached to each core  L1 cache is 64 kbytes  L2 cache is 512 kbytes  L3 Cache is shared between 6 cores  Cache is a “victim cache”  All loads go to L1 immediately and get evicted down the caches  Hardware prefetcher detects forward and backward strides through memory  Each core can perform a 128b add and 128b multiply per clock cycle  This requires SSE, packed instructions  “Stride-one vectorization”  6 cores share a “flat” memory  Non-uniform-memory-access (NUMA) beyond a node 2011 HPCMP User Group © Cray Inc. June 20, 2011 8
  • 9.
    Processor Frequency Peak Bandwidth Balance (Gflops) (GB/sec) (bytes/flop ) Istanbul 2.6 62.4 12.8 0.21 (XT5) 2.0 64 42.6 0.67 MC-8 2.3 73.6 42.6 0.58 2.4 76.8 42.6 0.55 1.9 91.2 42.6 0.47 MC-12 2.1 100.8 42.6 0.42 2.2 105.6 42.6 0.40 2011 HPCMP User Group © Cray Inc. June 20, 2011 9
  • 10.
    Gemini (XE-series) 2011 HPCMP User Group © Cray Inc. June 20, 2011 10
  • 11.
     Microkernel onCompute PEs, full featured Linux on Service PEs.  Service PEs specialize by function Compute PE  Software Architecture Login PE eliminates OS “Jitter” Network PE  Software Architecture enables reproducible run times System PE  Large machines boot in under I/O PE 30 minutes, including filesystem Service Partition Specialized Linux nodes 2011 HPCMP User Group © Cray Inc. June 20, 2011 11
  • 12.
    XE6 System External Login Server Boot RAID 10 GbE IB QDR 2011 HPCMP User Group © Cray Inc. June 20, 2011 13
  • 13.
    6.4 GB/sec directconnect Characteristics HyperTransport Number of 16 or 24 (MC) Cores 32 (IL) Peak 153 Gflops/sec Performance MC-8 (2.4) Peak 211 Gflops/sec Performance MC-12 (2.2) Memory Size 32 or 64 GB per node Memory 83.5 GB/sec Bandwidth 83.5 GB/sec direct connect memory Cray SeaStar2+ Interconnect 2011 HPCMP User Group © Cray Inc. June 20, 2011 14
  • 14.
    Greyhound Greyhound Greyhound Greyhound DDR3 Channel DDR3 Channel 6MB L3 HT3 6MB L3 Greyhound Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound DDR3 Channel Greyhound Greyhound DDR3 Channel HT3 HT3 Greyhound H Greyhound DDR3 Channel 6MB L3 Greyhound Greyhound T3 6MB L3 Greyhound Greyhound DDR3 Channel Cache Greyhound Cache Greyhound Greyhound Greyhound Greyhound HT3 Greyhound DDR3 Channel DDR3 Channel To Interconnect HT1 / HT3  2 Multi-Chip Modules, 4 Opteron Dies  8 Channels of DDR3 Bandwidth to 8 DIMMs  24 (or 16) Computational Cores, 24 MB of L3 cache  Dies are fully connected with HT3  Snoop Filter Feature Allows 4 Die SMP to scale well 2011 HPCMP User Group © Cray Inc. June 20, 2011 15
  • 15.
    Without snoop filter,a streams test shows 25MB/sec out of a possible 51.2 GB/sec or 48% of peak bandwidth 2011 HPCMP User Group © Cray Inc. June 20, 2011 16
  • 16.
    With snoop filter,a streams test shows 42.3 MB/sec out of a possible 51.2 GB/sec or 82% of peak bandwidth This feature will be key for two- socket Magny Cours Nodes which are the same architecture-wise 2011 HPCMP User Group © Cray Inc. June 20, 2011 17
  • 17.
     New computeblade with 8 AMD Magny Cours processors  Plug-compatible with XT5 cabinets and backplanes  Upgradeable to AMD’s “Interlagos” series  XE6 systems ship with the current SIO blade 2011 HPCMP User Group © Cray Inc. June 20, 2011 18
  • 18.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 19
  • 19.
     Supports 2Nodes per ASIC  168 GB/sec routing capacity  Scales to over 100,000 network endpoints  Link Level Reliability and Adaptive Hyper Hyper Routing Transport Transport 3 3  Advanced Resiliency Features  Provides global address NIC 0 Netlink NIC 1 SB space Block Gemini LO  Advanced NIC designed to Processor efficiently support 48-Port  MPI YARC Router  One-sided MPI  Shmem  UPC, Coarray FORTRAN 2011 HPCMP User Group © Cray Inc. June 20, 2011 20
  • 20.
    Cray Baker Node Characteristics Number of 16 or 24 10 12X Gemini Cores Channels Peak 140 or 210 Gflops/s (Each Gemini High Radix YARC Router Performance acts like two nodes on the 3D with adaptive Memory Size 32 or 64 GB per Torus) Routing node 168 GB/sec capacity Memory 85 GB/sec Bandwidth 2011 HPCMP User Group © Cray Inc. June 20, 2011 21
  • 21.
    Module with SeaStar Z Y X Module with Gemini 2011 HPCMP User Group © Cray Inc. June 20, 2011 22
  • 22.
    net rsp net req LB ht treq p net LB Ring ht treq np FMA req T net ht trsp net net A req S req req vc0 ht p req net R S req O ht np req B I BTE R net D rsp B vc1 Router Tiles HT3 Cave NL ht irsp NPT vc1 ht np net ireq rsp net req CQ NAT ht np req H ht p req net rsp headers ht p A AMO net ht p req ireq R net req req net req vc0 B RMT ht p req RAT net rsp LM CLM  FMA (Fast Memory Access)  Mechanism for most MPI transfers  Supports tens of millions of MPI requests per second  BTE (Block Transfer Engine)  Supports asynchronous block transfers between local and remote memory, in either direction  For use for large MPI transfers that happen in the background 2011 HPCMP User Group © Cray Inc. June 20, 2011 23
  • 23.
     Two GeminiASICs are packaged on a pin-compatible mezzanine card  Topology is a 3-D torus  Each lane of the torus is composed of 4 Gemini router “tiles”  Systems with SeaStar interconnects can be upgraded by swapping this card  100% of the 48 router tiles on each Gemini chip are used 2011 HPCMP User Group © Cray Inc. June 20, 2011 24
  • 24.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 28
  • 25.
    Name Architecture Processor Network # Cores Memory/Core Jade XT-4 AMD Seastar 2.1 8584 2GB DDR2-800 Budapest (2.1 Ghz) Einstein XT-5 AMD Seastar 2.1 12827 2GB (some Shanghai (2.4 nodes have Ghz) 4GB/core) DDR2-800 MRAP XT-5 AMD Seastar 2.1 10400 4GB DDR2-800 Barcelona (2.3 Ghz) Garnet XE-6 Magny Cours Gemini 1.0 20160 2GB DDR3-1333 8 core 2.4 Ghz Raptor XE-6 Magny Cours Gemini 1.0 43712 2GB DDR3-1333 8 core 2.4 Ghz Chugach XE-6 Magny Cours Gemini 1.0 11648 2GB DDR3 -1333 8 core 2.3 Ghz 2011 HPCMP User Group © Cray Inc. June 20, 2011 29
  • 26.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 30
  • 27.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 31
  • 28.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 32
  • 29.
    Low Velocity Airflow High Velocity Airflow Low Velocity Airflow High Velocity Airflow 2011 HPCMP User Group © Cray Inc. June 20, 2011 33 Low Velocity Airflow
  • 30.
    Cool air isreleased into the computer room Liquid Liquid/Vapor in Mixture out Hot air stream passes through evaporator, rejects heat to R134a via liquid-vapor phase change (evaporation). R134a absorbs energy only in the presence of heated air. Phase change is 10x more efficient than pure water cooling. 2011 HPCMP User Group © Cray Inc. June 20, 2011 34
  • 31.
    R134a piping Exit Evaporators Inlet Evaporator 2011 HPCMP User Group © Cray Inc. June 20, 2011 35
  • 32.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 36
  • 33.
    Term Meaning Purpose MDS Metadata Server Manages all file metadata for filesystem. 1 per FS OST Object Storage Target The basic “chunk” of data written to disk. Max 160 per file. OSS Object Storage Server Communicates with disks, manages 1 or more OSTs. 1 or more per FS Stripe Size Size of chunks. Controls the size of file chunks stored to OSTs. Can’t be changed once file is written. Stripe Count Number of OSTs used per Controls parallelism of file. Can’t file. be changed once file is writte. 2011 HPCMP User Group © Cray Inc. June 20, 2011 37
  • 34.
    2011 HPCMP UserGroup © Cray Inc. une 20, 2011 J 38
  • 35.
    2011 HPCMP UserGroup © Cray Inc. une 20, 2011 J 39
  • 36.
     32 MBper OST (32 MB – 5 GB) and 32 MB Transfer Size  Unable to take advantage of file system parallelism  Access to multiple disks adds overhead which hurts performance Single Writer Write Performance 120 100 80 Write (MB/s) 1 MB Stripe 60 32 MB Stripe 40 Lustre 20 0 1 2 4 16 32 64 128 160 Stripe Count 40 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 37.
     Single OST,256 MB File Size  Performance can be limited by the process (transfer size) or file system (stripe size) Single Writer Transfer vs. Stripe Size 140 120 100 Write (MB/s) 80 32 MB Transfer 60 8 MB Transfer 1 MB Transfer 40 Lustre 20 0 1 2 4 8 16 32 64 128 Stripe Size (MB) 41 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 38.
     Use thelfs command, libLUT, or MPIIO hints to adjust your stripe count and possibly size  lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)  lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)  export MPICH_MPIIO_HINTS=‘*: striping_factor=160’  Files inherit striping information from the parent directory, this cannot be changed once the file is written  Set the striping before copying in files 2011 HPCMP User Group © Cray Inc. June 20, 2011 42
  • 39.
    Available Compilers Cray Scientific Libraries Cray Message Passing Toolkit 2011 HPCMP User Group © Cray Inc. June 20, 2011 43
  • 40.
     Cray XT/XESupercomputers come with compiler wrappers to simplify building parallel applications (similar the mpicc/mpif90)  Fortran Compiler: ftn  C Compiler: cc  C++ Compiler: CC  Using these wrappers ensures that your code is built for the compute nodes and linked against important libraries  Cray MPT (MPI, Shmem, etc.)  Cray LibSci (BLAS, LAPACK, etc.)  …  Choosing the underlying compiler is via the PrgEnv-* modules, do not call the PGI, Cray, etc. compilers directly.  Always load the appropriate xtpe-<arch> module for your machine  Enables proper compiler target  Links optimized math libraries 2011 HPCMP User Group © Cray Inc. June 20, 2011 44
  • 41.
    …from Cray’s Perspective PGI – Very good Fortran and C, pretty good C++  Good vectorization  Good functional correctness with optimization enabled  Good manual and automatic prefetch capabilities  Very interested in the Linux HPC market, although that is not their only focus  Excellent working relationship with Cray, good bug responsiveness  Pathscale – Good Fortran, C, possibly good C++  Outstanding scalar optimization for loops that do not vectorize  Fortran front end uses an older version of the CCE Fortran front end  OpenMP uses a non-pthreads approach  Scalar benefits will not get as much mileage with longer vectors  Intel – Good Fortran, excellent C and C++ (if you ignore vectorization)  Automatic vectorization capabilities are modest, compared to PGI and CCE  Use of inline assembly is encouraged  Focus is more on best speed for scalar, non-scaling apps  Tuned for Intel architectures, but actually works well for some applications on AMD 2011 HPCMP User Group © Cray Inc. June 20, 2011 45
  • 42.
    …from Cray’s Perspective GNU so-so Fortran, outstanding C and C++ (if you ignore vectorization)  Obviously, the best for gcc compatability  Scalar optimizer was recently rewritten and is very good  Vectorization capabilities focus mostly on inline assembly  Note the last three releases have been incompatible with each other (4.3, 4.4, and 4.5) and required recompilation of Fortran modules  CCE – Outstanding Fortran, very good C, and okay C++  Very good vectorization  Very good Fortran language support; only real choice for Coarrays  C support is quite good, with UPC support  Very good scalar optimization and automatic parallelization  Clean implementation of OpenMP 3.0, with tasks  Sole delivery focus is on Linux-based Cray hardware systems  Best bug turnaround time (if it isn’t, let us know!)  Cleanest integration with other Cray tools (performance tools, debuggers, upcoming productivity tools)  No inline assembly support 2011 HPCMP User Group © Cray Inc. June 20, 2011 46
  • 43.
     PGI  -fast –Mipa=fast(,safe)  If you can be flexible with precision, also try -Mfprelaxed  Compiler feedback: -Minfo=all -Mneginfo  man pgf90; man pgcc; man pgCC; or pgf90 -help  Cray  <none, turned on by default>  Compiler feedback: -rm (Fortran) -hlist=m (C)  If you know you don’t want OpenMP: -xomp or -Othread0  man crayftn; man craycc ; man crayCC  Pathscale  -Ofast Note: this is a little looser with precision than other compilers  Compiler feedback: -LNO:simd_verbose=ON  man eko (“Every Known Optimization”)  GNU  -O2 / -O3  Compiler feedback: good luck  man gfortran; man gcc; man g++  Intel  -fast  Compiler feedback:  man ifort; man icc; man iCC 2011 HPCMP User Group © Cray Inc. June 20, 2011 47
  • 44.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 48
  • 45.
     Traditional (scalar)optimizations are controlled via -O# compiler flags  Default: -O2  More aggressive optimizations (including vectorization) are enabled with the -fast or -fastsse metaflags  These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre –Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz –Mpre  Interprocedural analysis allows the compiler to perform whole-program optimizations. This is enabled with –Mipa=fast  See man pgf90, man pgcc, or man pgCC for more information about compiler options. 2011 HPCMP User Group © Cray Inc. June 20, 2011 49
  • 46.
     Compiler feedbackis enabled with -Minfo and -Mneginfo  This can provide valuable information about what optimizations were or were not done and why.  To debug an optimized code, the -gopt flag will insert debugging information without disabling optimizations  It’s possible to disable optimizations included with -fast if you believe one is causing problems  For example: -fast -Mnolre enables -fast and then disables loop redundant optimizations  To get more information about any compiler flag, add -help with the flag in question  pgf90 -help -fast will give more information about the -fast flag  OpenMP is enabled with the -mp flag 2011 HPCMP User Group © Cray Inc. June 20, 2011 50
  • 47.
    Some compiler optionsmay effect both performance and accuracy. Lower accuracy is often higher performance, but it’s also able to enforce accuracy.  -Kieee: All FP math strictly conforms to IEEE 754 (off by default)  -Ktrap: Turns on processor trapping of FP exceptions  -Mdaz: Treat all denormalized numbers as zero  -Mflushz: Set SSE to flush-to-zero (on with -fast)  -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to speed up some floating point optimizations  Some other compilers turn this on by default, PGI chooses to favor accuracy to speed by default. 2011 HPCMP User Group © Cray Inc. June 20, 2011 51
  • 48.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 52
  • 49.
     Cray hasa long tradition of high performance compilers on Cray platforms (Traditional vector, T3E, X1, X2)  Vectorization  Parallelization  Code transformation  More…  Investigated leveraging an open source compiler called LLVM  First release December 2008 2011 HPCMP User Group © Cray Inc. June 20, 2011 53
  • 50.
    Fortran Source C and C++ Source C and C++ Front End supplied by Edison Design Group, with Cray-developed Fortran Front End C & C++ Front End code for extensions and interface support Interprocedural Analysis Cray Inc. Compiler Technology Compiler Optimization and Parallelization X86 Code Cray X2 Code Generator Generator X86 Code Generation from Open Source LLVM, with Object File additional Cray-developed optimizations and interface support 2011 HPCMP User Group © Cray Inc. June 20, 2011 54
  • 51.
     Standard conforminglanguages and programming models  Fortran 2003  UPC & CoArray Fortran  Fully optimized and integrated into the compiler  No preprocessor involved  Target the network appropriately:  GASNet with Portals  DMAPP with Gemini & Aries  Ability and motivation to provide high-quality support for custom Cray network hardware  Cray technology focused on scientific applications  Takes advantage of Cray’s extensive knowledge of automatic vectorization  Takes advantage of Cray’s extensive knowledge of automatic shared memory parallelization  Supplements, rather than replaces, the available compiler choices 2011 HPCMP User Group © Cray Inc. June 20, 2011 55
  • 52.
     Make sureit is available  module avail PrgEnv-cray  To access the Cray compiler  module load PrgEnv-cray  To target the various chip  module load xtpe-[barcelona,shanghi,mc8]  Once you have loaded the module “cc” and “ftn” are the Cray compilers  Recommend just using default options  Use –rm (fortran) and –hlist=m (C) to find out what happened  man crayftn 2011 HPCMP User Group © Cray Inc. June 20, 2011 56
  • 53.
     Excellent Vectorization  Vectorize more loops than other compilers  OpenMP 3.0  Task and Nesting  PGAS: Functional UPC and CAF available today  C++ Support  Automatic Parallelization  Modernized version of Cray X1 streaming capability  Interacts with OMP directives  Cache optimizations  Automatic Blocking  Automatic Management of what stays in cache  Prefetching, Interchange, Fusion, and much more… 2011 HPCMP User Group © Cray Inc. June 20, 2011 57
  • 54.
     Loop BasedOptimizations  Vectorization  OpenMP  Autothreading  Interchange  Pattern Matching  Cache blocking/ non-temporal / prefetching  Fortran 2003 Standard; working on 2008  PGAS (UPC and Co-Array Fortran)  Some performance optimizations available in 7.1  Optimization Feedback: Loopmark  Focus 2011 HPCMP User Group © Cray Inc. June 20, 2011 58
  • 55.
     Cray compilersupports a full and growing set of directives and pragmas !dir$ concurrent !dir$ ivdep !dir$ interchange !dir$ unroll !dir$ loop_info [max_trips] [cache_na] ... Many more !dir$ blockable man directives man loop_info 2011 HPCMP User Group © Cray Inc. June 20, 2011 59
  • 56.
     Compiler cangenerate an filename.lst file.  Contains annotated listing of your source code with letter indicating important optimizations %%% L o o p m a r k L e g e n d %%% Primary Loop Type Modifiers ------- ---- ---- --------- a - vector atomic memory operation A - Pattern matched b - blocked C - Collapsed f - fused D - Deleted i - interchanged E - Cloned m - streamed but not partitioned I - Inlined p - conditional, partial and/or computed M - Multithreaded r - unrolled P - Parallel/Tasked s - shortloop V - Vectorized t - array syntax temp used W - Unwound w - unwound 2011 HPCMP User Group © Cray Inc. June 20, 2011 60
  • 57.
    • ftn –rm… or cc –hlist=m … 29. b-------< do i3=2,n3-1 30. b b-----< do i2=2,n2-1 31. b b Vr--< do i1=1,n1 32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3) 33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1) 34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1) 35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1) 36. b b Vr--> enddo 37. b b Vr--< do i1=2,n1-1 38. b b Vr r(i1,i2,i3) = v(i1,i2,i3) 39. b b Vr > - a(0) * u(i1,i2,i3) 40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) ) 41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) ) 42. b b Vr--> enddo 43. b b-----> enddo 44. b-------> enddo 2011 HPCMP User Group © Cray Inc. June 20, 2011 61
  • 58.
    ftn-6289 ftn: VECTORFile = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized. 2011 HPCMP User Group © Cray Inc. June 20, 2011 62
  • 59.
     -hbyteswapio  Link time option  Applies to all unformatted fortran IO  Assign command  With the PrgEnv-cray module loaded do this: setenv FILENV assign.txt assign -N swap_endian g:su assign -N swap_endian g:du  Can use assign to be more precise 2011 HPCMP User Group © Cray Inc. June 20, 2011 63
  • 60.
     OpenMP isON by default  Optimizations controlled by –Othread#  To shut off use –Othread0 or –xomp or –hnoomp  Autothreading is NOT on by default;  -hautothread to turn on  Modernized version of Cray X1 streaming capability  Interacts with OMP directives If you do not want to use OpenMP and have OMP directives in the code, make sure to make a run with OpenMP shut off at compile time 2011 HPCMP User Group © Cray Inc. June 20, 2011 64
  • 61.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 65
  • 62.
     Cray havehistorically played a role in scientific library development  BLAS3 were largely designed for Crays  Standard libraries were tuned for Cray vector processors (later COTS)  Cray have always tuned standard libraries for Cray interconnect  In the 90s, Cray provided many non-standard libraries  Sparse direct, sparse iterative  These days the goal is to remain portable (standard APIs) whilst providing more performance  Advanced features, tuning knobs, environment variables 2011 HPCMP User Group © Cray Inc. June 20, 2011 66
  • 63.
    FFT Dense Sparse BLAS CRAFFT CASK LAPACK FFTW ScaLAPACK PETSc IRT P-CRAFFT Trilinos CASE IRT – Iterative Refinement Toolkit CASK – Cray Adaptive Sparse Kernels CRAFFT – Cray Adaptive FFT CASE – Cray Adaptive Simple Eigensolver 2011 HPCMP User Group © Cray Inc. June 20, 2011 69
  • 64.
     There aremany libsci libraries on the systems  One for each of  Compiler (intel, cray, gnu, pathscale, pgi )  Single thread, multiple thread  Target (istanbul, mc12 )  Best way to use libsci is to ignore all of this  Load the xtpe-module (some sites set this by default)  E.g. module load xtpe-shanghai / xtpe-istanbul / xtpe-mc8  Cray’s drivers will link the library automatically  PETSc, Trilinos, fftw, acml all have their own module  Tip : make sure you have the correct library loaded e.g. –Wl, -ydgemm_ 2011 HPCMP User Group © Cray Inc. June 20, 2011 70
  • 65.
     Perhaps youwant to link another library such as ACML  This can be done. If the library is provided by Cray, then load the module. The link will be performed with the libraries in the correct order.  If the library is not provided by Cray and has no module, add it to the link line.  Items you add to the explicit link will be in the correct place  Note, to get explicit BLAS from ACML but scalapack from libsci  Load acml module. Explicit calls to BLAS in code resolve from ACML  BLAS calls from the scalapack code will be resolved from libsci (no way around this) 2011 HPCMP User Group © Cray Inc. June 20, 2011 71
  • 66.
     Threading capabilitiesin previous libsci versions were poor  Used PTHREADS (more explicit affinity etc)  Required explicit linking to a _mp version of libsci  Was a source of concern for some applications that need hybrid performance and interoperability with openMP  LibSci 10.4.2 February 2010  OpenMP-aware LibSci  Allows calling of BLAS inside or outside parallel region  Single library supported (there is still a single thread lib)  Usage – load the xtpe module for your system (mc12) GOTO_NUM_THREADS outmoded – use OMP_NUM_THREADS 2011 HPCMP User Group © Cray Inc. June 20, 2011 72
  • 67.
     Allows seamlesscalling of the BLAS within or without a parallel region e.g. OMP_NUM_THREADS = 12 call dgemm(…) threaded dgemm is used with 12 threads !$OMP PARALLEL DO do call dgemm(…) single thread dgemm is used end do Some users are requesting a further layer of parallelism here (see later) 2011 HPCMP User Group © Cray Inc. June 20, 2011 73
  • 68.
    120 Libsci DGEMM efficiency 100 80 GFLOPs 1thread 60 3threads 6threads 40 9threads 12threads 20 0 Dimension (square) Inc. 2011 HPCMP User Group © Cray June 20, 2011 74
  • 69.
    140 Libsci-10.5.2 performance on 2 x MC12 2.0 GHz K=64 120 (Cray XE6) K=128 100 K=200 K=228 80 GFLOPS K=256 60 K=300 K=400 40 K=500 20 K=600 0 K=700 1 2 4 8 12 16 20 24 K=800 Number of threads 2011 HPCMP User Group © Cray Inc. June 20, 2011 75
  • 70.
     All BLASlibraries are optimized for rank-k update * =  However, a huge % of dgemm usage is not from solvers but explicit calls  E.g. DCA++ matrices are of this form * =  How can we very easily provide an optimization for these types of matrices? 2011 HPCMP User Group © Cray Inc. June 20, 2011 76
  • 71.
     Cray BLASexisted on every Cray machine between Cray-2 and Cray X2  Cray XT line did not include Cray BLAS  Cray’s expertise was in vector processors  GotoBLAS was the best performing x86 BLAS  LibGoto is now discontinued  In Q3 2011 LibSci will be released with Cray BLAS 2011 HPCMP User Group © Cray Inc. June 20, 2011 77
  • 72.
    1. Customers require more OpenMP features unobtainable with current library 2. Customers require more adaptive performance for unusual problems .e.g. DCA++ 3. Interlagos / Bulldozer is a dramatic shift in ISA/architecture/performance 4. Our auto-tuning framework has advanced to the point that we can tackle this problem (good BLAS is easy, excellent BLAS is very hard) 5. Need for Bit-reproducable BLAS at high-performance 2011 HPCMP User Group © Cray Inc. June 20, 2011 78
  • 73.
    "anything that canbe represented in C, Fortran or ASM code can be generated automatically by one instance of an abstract operator in high-level code“ In other words, if we can create a purely general model of matrix-multiplication, and create every instance of it, then at least one of the generated schemes will perform well 2011 HPCMP User Group © Cray Inc. June 20, 2011 79
  • 74.
     Start witha completely general formulation of the BLAS  Use a DSL that expresses every important optimization  Auto-generate every combination of orderings, buffering, and optimization  For every combination of the above, sweep all possible sizes  For a given input set ( M, N, K, datatype, alpha, beta ) map the best dgemm routine to the input  The current library should be a specific instance of the above  Worst-case performance can be no worse than current library  The lowest level of blocking is a hand-written assembly kernel 2011 HPCMP User Group © Cray Inc. June 20, 2011 80
  • 75.
    7.5 7.45 7.4 7.35 7.3 bframe GFLOPS 7.25 libsci 7.2 7.15 7.1 7.05 143 72 12 17 22 27 62 133 67 37 42 57 105 2 7 47 100 128 138 95 32 52 2011 HPCMP User Group © Cray Inc. June 20, 2011 81
  • 76.
     New optimizationsfor Gemini network in the ScaLAPACK LU and Cholesky routines 1. Change the default broadcast topology to match the Gemini network 2. Give tools to allow the topology to be changed by the user 3. Give guidance on how grid-shape can affect the performance 2011 HPCMP User Group © Cray Inc. June 20, 2011 82
  • 77.
     Parallel Versionof LAPACK GETRF  Panel Factorization  Only single column block is involved  The rest of PEs are waiting  Trailing matrix update  Major part of the computation  Column-wise broadcast (Blocking)  Row-wise broadcast (Asynchronous)  Data is packed before sending using PBLAS  Broadcast uses BLACS library  These broadcasts are the major communication patterns 2011 HPCMP User Group © Cray Inc. June 20, 2011 83
  • 78.
     MPI default  Binomial Tree + node-aware broadcast  All PEs makes implicit barrier to make sure the completion  Not suitable for rank-k update  Bidirectional-Ring broadcast  Root PE makes 2 MPI Send calls to both of the directions  The immediate neighbor finishes first  ScaLAPACK’s default  Better than MPI 2011 HPCMP User Group © Cray Inc. June 20, 2011 84
  • 79.
     Increasing RingBroadcast (our new default)  Root makes a single MPI call to the immediate neighbor  Pipelining  Better than bidirectional ring  The immediate neighbor finishes first  Multi-Ring Broadcast (2, 4, 8 etc)  The immediate neighbor finishes first  The root PE sends to multiple sub-rings  Can be done with tree algorithm  2 rings seems the best for row-wise broadcast of LU 2011 HPCMP User Group © Cray Inc. June 20, 2011 85
  • 80.
     Hypercube  Behaves like MPI default  Too many collisions in the message traffic  Decreasing Ring  The immediate neighbor finishes last  No benefit in LU  Modified Increasing Ring  Best performance in HPL  As good as increasing ring 2011 HPCMP User Group © Cray Inc. June 20, 2011 86
  • 81.
    XDLU performance: 3072cores, size=65536 10000 9000 8000 7000 6000 Gflops 5000 4000 3000 SRING IRING 2000 1000 0 32 64 32 64 32 64 32 64 32 64 48 48 24 24 12 12 32 32 16 16 64 64 128 128 256 256 96 96 192 192 NB / P / Q 2011 HPCMP User Group © Cray Inc. June 20, 2011 87
  • 82.
    XDLU performance: 6144cores, size=65536 14000 12000 10000 8000 Gflops 6000 SRING 4000 IRING 2000 0 32 64 32 64 32 64 32 64 32 64 48 48 24 24 12 12 64 64 32 32 128 128 256 256 512 512 96 96 192 192 NB / P / Q 2011 HPCMP User Group © Cray Inc. June 20, 2011 88
  • 83.
     Row MajorProcess Grid puts adjacent PEs in the same row  Adjacent PEs are most probably located in the same node  In flat MPI, 16 or 24 PEs are in the same node  In hybrid mode, several are in the same node  Most MPI sends in I-ring happen in the same node  MPI has good shared-memory device  Good pipelining Node 0 Node 1 Node 2 2011 HPCMP User Group © Cray Inc. June 20, 2011 89
  • 84.
     For PxGETRF:  The variables let users to choose  SCALAPACK_LU_CBCAST broadcast algorithm :  SCALAPACK_LU_RBCAST  IRING increasing ring  For PxPOTRF: (default value)  DRING decreasing ring  SCALAPACK_LLT_CBCAST  SRING split ring (old default  SCALAPCK_LLT_RBCAST value)  SCALAPACK_UTU_CBCAST  MRING multi-ring SCALAPACK_UTU_RBCAST  HYPR hypercube  MPI mpi_bcast  TREE tree  There is also a set function, allowing  FULL full connected the user to change these on the fly 2011 HPCMP User Group © Cray Inc. June 20, 2011 91
  • 85.
     Grid shape/ size  Square grid is most common  Try to use Q = x * P grids, where x = 2, 4, 6, 8  Square grids not often the best  Blocksize  Unlike HPL, fine-tuning not important.  64 usually the best  Ordering  Try using column-major ordering, it can be better  BCAST  The new default will be a huge improvement if you can make your grid the right way. If you cannot, play with the environment variables. 2011 HPCMP User Group © Cray Inc. June 20, 2011 92
  • 86.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 93
  • 87.
     Full MPI2support (except process spawning) based on ANL MPICH2  Cray used the MPICH2 Nemesis layer for Gemini  Cray-tuned collectives  Cray-tuned ROMIO for MPI-IO  Current Release: 5.3.0 (MPICH 1.3.1)  Improved MPI_Allreduce and MPI_alltoallv  Initial support for checkpoint/restart for MPI or Cray SHMEM on XE systems  Improved support for MPI thread safety.  module load xt-mpich2  Tuned SHMEM library  module load xt-shmem 2011 HPCMP User Group © Cray Inc. June 20, 2011 94
  • 88.
    MPI_Alltoall with 10,000Processes Comparing Original vs Optimized Algorithms on Cray XE6 Systems 25000000 20000000 Microseconds 15000000 Original Algorithm 10000000 Optimized Algorithm 5000000 0 256 512 1024 2048 4096 8192 16384 32768 MessageHPCMP User Group © Cray Inc. 2011 Size (in bytes) June 20, 2011 95
  • 89.
    8-Byte MPI_Allgather andMPI_Allgatherv Scaling Comparing Original vs Optimized Algorithms 45000 on Cray XE6 Systems 40000 MPI_Allgather and 35000 MPI_Allgatherv algorithms optimized for Cray XE6. 30000 Microseconds Original Allgather 25000 20000 Optimized Allgather 15000 Original Allgatherv 10000 Optimized Allgatherv 5000 0 1024p 2048p 4096p 8192p 16384p 32768p Number ofUser Group © Cray Inc. June 20, 2011 2011 HPCMP Processes 96
  • 90.
     Default is8192 bytes  Maximum size message that can go through the eager protocol.  May help for apps that are sending medium size messages, and do better when loosely coupled. Does application have a large amount of time in MPI_Waitall? Setting this environment variable higher may help.  Max value is 131072 bytes.  Remember for this path it helps to pre-post receives if possible.  Note that a 40-byte CH3 header is included when accounting for the message size. 2011 HPCMP User Group © Cray Inc. June 20, 2011 97
  • 91.
     Default is64 32K buffers ( 2M total )  Controls number of 32K DMA buffers available for each rank to use in the Eager protocol described earlier  May help to modestly increase. But other resources constrain the usability of a large number of buffers. 2011 HPCMP User Group © Cray Inc. June 20, 2011 98
  • 92.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 99
  • 93.
     What doI mean by PGAS?  Partitioned Global Address Space  UPC  CoArray Fortran ( Fortran 2008 )  SHMEM (I will count as PGAS for convenience)  SHMEM: Library based  Not part of any language standard  Compiler independent  Compiler has no knowledge that it is compiling a PGAS code and does nothing different. I.E. no transformations or optimizations 2011 HPCMP User Group © Cray Inc. June 20, 2011 100
  • 94.
     UPC  Specification that extends the ISO/IEC 9899 standard for C  Integrated into the language  Heavily compiler dependent  Compiler intimately involved in detecting and executing remote references  Flexible, but filled with challenges like pointers, a lack of true multidimensional arrays, and many options for distributing data  Fortran 2008  Now incorporates coarrays  Compiler dependent  Philosophically different from UPC  Replication of arrays on every image with “easy and obvious” ways to access those remote locations. 2011 HPCMP User Group © Cray Inc. June 20, 2011 101
  • 95.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 102
  • 96.
     Translate theUPC source code into hardware executable operations that produce the proper behavior, as defined by the specification  Storing to a remote location?  Loading from a remote location?  When does the transfer need to be complete?  Are there any dependencies between this transfer and anything else?  No ordering guarantees provided by the network, compiler is responsible for making sure everything gets to its destination in the correct order. 2011 HPCMP User Group © Cray Inc. June 20, 2011 103
  • 97.
    for ( i= 0; i < ELEMS_PER_THREAD; i+=1 ) { local_data[i] += global_2d[i][target]; } for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) { temp = pgas_get(&global_2d[i]); // Initiate the get pgas_fence(); // makes sure the get is complete local_data[i] += temp; // Use the local location to complete the operation }  The compiler must  Recognize you are referencing a shared location  Initiate the load of the remote data  Make sure the transfer has completed  Proceed with the calculation  Repeat for all iterations of the loop 2011 HPCMP User Group © Cray Inc. June 20, 2011 104
  • 98.
    for ( i= 0; i < ELEMS_PER_THREAD; i+=1 ) { temp = pgas_get(&global_2d[i]); // Initiate the get pgas_fence(); // makes sure the get is complete local_data[i] += temp; // Use the local location to complete the operation }  Simple translation results in  Single word references  Lots of fences  Little to no latency hiding  No use of special hardware  Nothing here says “fast” 2011 HPCMP User 105 June 20, 2011 Group © Cray Inc.
  • 99.
    Want the compilerto generate code that will run as fast as possible given what the user has written, or allow the user to get fast performance with simple modifications.  Increase message size  Do multi / many word transfers whenever possible, not single word.  Minimize fences  Delay fence “as much as possible”  Eliminate the fence in some circumstances  Use the appropriate hardware  Use on-node hardware for on-node transfers  Use transfer mechanism appropriate for this message size  Overlap communication and computation  Use hardware atomic functions where appropriate 2011 HPCMP User Group © Cray Inc. June 20, 2011 106
  • 100.
    Primary Loop Type Modifiers A - Pattern matched a - atomic memory operation b - blocked C - Collapsed c - conditional and/or computed D - Deleted E - Cloned f - fused G - Accelerated g - partitioned I - Inlined i - interchanged M - Multithreaded m - partitioned n - non-blocking remote transfer p - partial r - unrolled s - shortloop V - Vectorized w - unwound 2011 HPCMP User Group © Cray Inc. June 20, 2011 107
  • 101.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 108
  • 102.
    15. shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS]; … 83. 1 before = upc_ticks_now(); 84. 1 r8------< for ( i = 0, j = target; i < ELEMS_PER_THREAD ; 85. 1 r8 i += 1, j += THREADS ) { 86. 1 r8 n local_data[i]= global_1d[j]; 87. 1 r8------> } 88. 1 after = upc_ticks_now();  1D get BW= 0.027598 Gbytes/s 2011 HPCMP User 109 June 20, 2011 Group © Cray Inc.
  • 103.
    15. shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS]; … 101. 1 before = upc_ticks_now(); 102. 1 upc_memget(&local_data[0],&global_1d[target],8*ELEMS_PER_THREAD); 103. 1 104. 1 after = upc_ticks_now();  1D get BW= 0.027598 Gbytes/s  1D upc_memget BW= 4.972960 Gbytes/s  upc_memget is 184 times faster!! 2011 HPCMP User 110 June 20, 2011 Group © Cray Inc.
  • 104.
    16. shared long global_2d[MAX_ELEMS_PER_THREAD][THREADS]; … 121. 1 A-------< for ( i = 0; i < ELEMS_PER_THREAD; i+=1) { 122. 1 A local_data[i] = global_2d[i][target]; 123. 1 A-------> }  1D get BW= 0.027598 Gbytes/s  1D upc_memget BW= 4.972960 Gbytes/s  2D get time BW= 4.905653 Gbytes/s  Pattern matching can give you the same performance as if using upc_memget 2011 HPCMP User 111 June 20, 2011 Group © Cray Inc.
  • 105.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 112
  • 106.
     PGAS datareferences made by the single statement immediately following the pgas defer_sync directive will not be synchronized until the next fence instruction.  Only applies to next UPC/CAF statement  Does not apply to upc “routines”  Does not apply to shmem routines  Normally the compiler synchronizes the references in a statement as late as possible without violating program semantics. The purpose of the defer_sync directive is to synchronize the references even later, beyond where the compiler can determine it is safe.  Extremely powerful!  Can easily overlap communication and computation with this statement  Can apply to both “gets” and “puts”  Can be used to implement a variety of “tricks”. Use your imagination! 2011 HPCMP User Group © Cray Inc. June 20, 2011 113
  • 107.
    CrayPAT 2011 HPCMP UserGroup © Cray Inc. June 20, 2011 114
  • 108.
     Future systembasic characteristics:  Many-core, hybrid multi-core computing  Increase in on-node concurrency  10s-100s of cores sharing memory  With or without a companion accelerator  Vector hardware at the low level  Impact on applications:  Restructure / evolve applications while using existing programming models to take advantage of increased concurrency  Expand on use of mixed-mode programming models (MPI + OpenMP + accelerated kernels, etc.) 2011 HPCMP User Group © Cray Inc. June 20, 2011 115
  • 109.
     Focus onautomation (simplify tool usage, provide feedback based on analysis)  Enhance support for multiple programming models within a program (MPI, PGAS, OpenMP, SHMEM)  Scaling (larger jobs, more data, better tool response)  New processors and interconnects  Extend performance tools to include pre-runtime optimization information from the Cray compiler 2011 HPCMP User Group © Cray Inc. June 20, 2011 116
  • 110.
     New predefinedwrappers (ADIOS, ARMCI, PetSc, PGAS libraries)  More UPC and Co-array Fortran support  Support for non-record locking file systems  Support for applications built with shared libraries  Support for Chapel programs  pat_report tables available in Cray Apprentice2 2011 HPCMP User Group © Cray Inc. June 20, 2011 117
  • 111.
     Enhanced PGASsupport is available in perftools 5.1.3 and later  Profiles of a PGAS program can be created to show:  Top time consuming functions/line numbers in the code  Load imbalance information  Performance statistics attributed to user source by default  Can expose statistics by library as well  To see underlying operations, such as wait time on barriers  Data collection is based on methods used for MPI library  PGAS data is collected by default when using Automatic Profiling Analysis (pat_build –O apa)  Predefined wrappers for runtime libraries (caf, upc, pgas) enable attribution of samples or time to user source  UPC and SHMEM heap tracking coming in subsequent release  -g heap will track shared heap in addition to local heap June 20, 2011 2011 HPCMP User Group © Cray Inc. 118
  • 112.
    Table 1: Profile by Function Samp % | Samp | Imb. | Imb. |Group | | Samp | Samp % | Function | | | | PE='HIDE' 100.0% | 48 | -- | -- |Total |------------------------------------------ | 95.8% | 46 | -- | -- |USER ||----------------------------------------- || 83.3% | 40 | 1.00 | 3.3% |all2all || 6.2% | 3 | 0.50 | 22.2% |do_cksum || 2.1% | 1 | 1.00 | 66.7% |do_all2all || 2.1% | 1 | 0.50 | 66.7% |mpp_accum_long || 2.1% | 1 | 0.50 | 66.7% |mpp_alloc ||========================================= | 4.2% | 2 | -- | -- |ETC ||----------------------------------------- || 4.2% | 2 | 0.50 | 33.3% |bzero |========================================== June 20, 2011 2011 HPCMP User Group © Cray Inc. 119
  • 113.
    Table 2: Profile by Group, Function, and Line Samp % | Samp | Imb. | Imb. |Group | | Samp | Samp % | Function | | | | Source | | | | Line | | | | PE='HIDE' 100.0% | 48 | -- | -- |Total |-------------------------------------------- | 95.8% | 46 | -- | -- |USER ||------------------------------------------- || 83.3% | 40 | -- | -- |all2all 3| | | | | mpp_bench.c 4| | | | | line.298 || 6.2% | 3 | -- | -- |do_cksum 3| | | | | mpp_bench.c ||||----------------------------------------- 4||| 2.1% | 1 | 0.25 | 33.3% |line.315 4||| 4.2% | 2 | 0.25 | 16.7% |line.316 ||||========================================= June 20, 2011 2011 HPCMP User Group © Cray Inc. 120
  • 114.
    Table 1: Profile by Function and Callers, with Line Numbers Samp % | Samp |Group | | Function | | Caller | | PE='HIDE’ 100.0% | 47 |Total |--------------------------- | 93.6% | 44 |ETC ||-------------------------- || 85.1% | 40 |upc_memput 3| | | all2all:mpp_bench.c:line.298 4| | | do_all2all:mpp_bench.c:line.348 5| | | main:test_all2all.c:line.70 || 4.3% | 2 |bzero 3| | | (N/A):(N/A):line.0 || 2.1% | 1 |upc_all_alloc 3| | | mpp_alloc:mpp_bench.c:line.143 4| | | main:test_all2all.c:line.25 || 2.1% | 1 |upc_all_reduceUL 3| | | mpp_accum_long:mpp_bench.c:line.185 4| | | do_cksum:mpp_bench.c:line.317 5| | | do_all2all:mpp_bench.c:line.341 6| | | main:test_all2all.c:line.70 ||========================== June 20, 2011 2011 HPCMP User Group © Cray Inc. 121
  • 115.
    Table 1: Profile by Function and Callers, with Line Numbers Time % | Time | Calls |Group | | | Function | | | Caller | | | PE='HIDE' 100.0% | 0.795844 | 73904.0 |Total |----------------------------------------- | 78.9% | 0.628058 | 41121.8 |PGAS ||---------------------------------------- || 76.1% | 0.605945 | 32768.0 |__pgas_put 3| | | | all2all:mpp_bench.c:line.298 4| | | | do_all2all:mpp_bench.c:line.348 5| | | | main:test_all2all.c:line.70 || 1.5% | 0.012113 | 10.0 |__pgas_barrier 3| | | | (N/A):(N/A):line.0 … June 20, 2011 2011 HPCMP User Group © Cray Inc. 122
  • 116.
    ||======================================== | 15.7% | 0.125006 | 3.0 |USER ||---------------------------------------- || 12.2% | 0.097125 | 1.0 |do_all2all 3| | | | main:test_all2all.c:line.70 || 3.5% | 0.027668 | 1.0 |main 3| | | | (N/A):(N/A):line.0 ||======================================== | 5.4% | 0.042777 | 32777.2 |UPC ||---------------------------------------- || 5.3% | 0.042321 | 32768.0 |upc_memput 3| | | | all2all:mpp_bench.c:line.298 4| | | | do_all2all:mpp_bench.c:line.348 5| | | | main:test_all2all.c:line.70 |========================================= June 20, 2011 2011 HPCMP User Group © Cray Inc. 123
  • 117.
    New text table icon Right click for table generation options 2011 HPCMP User Group © Cray Inc. June 20, 2011 124
  • 118.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 125
  • 119.
     Scalability  New .ap2 data format and client / server model  Reduced pat_report processing and report generation times  Reduced app2 data load times  Graphical presentation handled locally (not passed through ssh connection)  Better tool responsiveness  Minimizes data loaded into memory at any given time  Reduced server footprint on Cray XT/XE service node  Larger jobs supported  Distributed Cray Apprentice2 (app2) client for Linux  app2 client for Mac and Windows laptops coming later this year 2011 HPCMP User Group © Cray Inc. June 20, 2011 126
  • 120.
     CPMD  MPI, instrumented with pat_build –u, HWPC=1  960 cores Perftools 5.1.3 Perftools 5.2.0 .xf -> .ap2 88.5 seconds 22.9 seconds ap2 -> report 1512.27 seconds 49.6 seconds  VASP  MPI, instrumented with pat_build –gmpi –u, HWPC=3  768 cores Perftools 5.1.3 Perftools 5.2.0 .xf -> .ap2 45.2 seconds 15.9 seconds ap2 -> report 796.9 seconds 28.0 seconds 2011 HPCMP User Group © Cray Inc. June 20, 2011 127
  • 121.
    ‘:’ signifies  FromLinux desktop – a remote host instead of  % module load perftools ap2 file  % app2  % app2 kaibab:  % app2 kaibab:/lus/scratch/heidi/swim+pat+10302-0t.ap2  File->Open Remote… 2011 HPCMP User Group © Cray Inc. June 20, 2011 128
  • 122.
     Optional app2client for Linux desktop available as of 5.2.0  Can still run app2 from Cray service node  Improves response times as X11 traffic is no longer passed through the ssh connection  Replaces 32-bit Linux desktop version of Cray Apprentice2  Uses libssh to establish connection  app2 clients for Windows and Mac coming in subsequent release 2011 HPCMP User Group © Cray Inc. June 20, 2011 129
  • 123.
    Linux desktop All data from Cray XT login Collected Compute nodes my_program.ap2 + performance X Window X11 protocol data app2 System application my_program.ap2 my_program+apa  Log into Cray XT/XE login node  % ssh –Y seal  Launch Cray Apprentice2 on Cray XT/XE login node  % app2 /lus/scratch/mydir/my_program.ap2  User Interface displayed on desktop via ssh trusted X11 forwarding  Entire my_program.ap2 file loaded into memory on XT login node (can be Gbytes of data) 2011 HPCMP User Group © Cray Inc. June 20, 2011 130
  • 124.
    Linux desktop User requested data Cray XT login Collected Compute nodes from X Window performance my_program.ap2 app2 server System data application my_program.ap2 my_program+apa app2 client  Launch Cray Apprentice2 on desktop, point to data  % app2 seal:/lus/scratch/mydir/my_program.ap2  User Interface displayed on desktop via X Windows-based software  Minimal subset of data from my_program.ap2 loaded into memory on Cray XT/XE service node at any given time  Only data requested sent from server to client 2011 HPCMP User Group © Cray Inc. June 20, 2011 131
  • 125.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 132
  • 126.
     Major changeto the way HW counters are collected starting with CPMAT 5.2.1 and CLE 4.0 (In conjunction with Interlagos support)  Linux has officially incorporated support for accessing counters through a perf_events subsystem. Until this, Linux kernels have had to be patched to add support for perfmon2 which provided access to the counters for PAPI and for CrayPat.  Seamless to users except –  Overhead incurred when accessing counters has increased  Creates additional application perturbation  Working to bring this back in line with perfmon2 overhead 2011 HPCMP User Group © Cray Inc. June 20, 2011 133
  • 127.
     When possible,CrayPat will identify dominant communication grids (communication patterns) in a program  Example: nearest neighbor exchange in 2 or 3 dimensions  Sweep3d uses a 2-D grid for communication  Determine whether or not a custom MPI rank order will produce a significant performance benefit  Custom rank orders are helpful for programs with significant point-to-point communication  Doesn’t interfere with MPI collective communication optimizations 2011 HPCMP User Group © Cray Inc. June 20, 2011 134
  • 128.
     Focuses onintra-node communication (place ranks that communication frequently on the same node, or close by)  Option to focus on other metrics such as memory bandwidth  Determine rank order used during run that produced data  Determine grid that defines the communication  Produce a custom rank order if it’s beneficial based on grid size, grid order and cost metric  Summarize findings in report  Describe how to re-run with custom rank order 2011 HPCMP User Group © Cray Inc. June 20, 2011 135
  • 129.
    For Sweep3d with768 MPI ranks: This application uses point-to-point MPI communication between nearest neighbors in a 32 X 24 grid pattern. Time spent in this communication accounted for over 50% of the execution time. A significant fraction (but not more than 60%) of this time could potentially be saved by using the rank order in the file MPICH_RANK_ORDER.g which was generated along with this report. To re-run with a custom rank order … 2011 HPCMP User Group © Cray Inc. June 20, 2011 136
  • 130.
     Assist theuser with application performance analysis and optimization  Help user identify important and meaningful information from potentially massive data sets  Help user identify problem areas instead of just reporting data  Bring optimization knowledge to a wider set of users  Focus on ease of use and intuitive user interfaces  Automatic program instrumentation  Automatic analysis  Target scalability issues in all areas of tool development  Data management  Storage, movement, presentation June 20, 2011 2011 HPCMP User Group © Cray Inc. 137
  • 131.
     Supports traditionalpost-mortem performance analysis  Automatic identification of performance problems  Indication of causes of problems  Suggestions of modifications for performance improvement  CrayPat  pat_build: automatic instrumentation (no source code changes needed)  run-time library for measurements (transparent to the user)  pat_report for performance analysis reports  pat_help: online help utility  Cray Apprentice2  Graphical performance analysis and visualization tool June 20, 2011 2011 HPCMP User Group © Cray Inc. 138
  • 132.
     CrayPat  Instrumentation of optimized code  No source code modification required  Data collection transparent to the user  Text-based performance reports  Derived metrics  Performance analysis  Cray Apprentice2  Performance data visualization tool  Call tree view  Source code mappings June 20, 2011 2011 HPCMP User Group © Cray Inc. 139
  • 133.
     When performancemeasurement is triggered  External agent (asynchronous)  Sampling  Timer interrupt  Hardware counters overflow  Internal agent (synchronous)  Code instrumentation  Event based  Automatic or manual instrumentation  How performance data is recorded  Profile ::= Summation of events over time  run time summarization (functions, call sites, loops, …)  Trace file ::= Sequence of events over time June 20, 2011 2011 HPCMP User Group © Cray Inc. 140
  • 134.
     Millions oflines of code  Automatic profiling analysis  Identifies top time consuming routines  Automatically creates instrumentation template customized to your application  Lots of processes/threads  Load imbalance analysis  Identifies computational code regions and synchronization calls that could benefit most from load balance optimization  Estimates savings if corresponding section of code were balanced  Long running applications  Detection of outliers June 20, 2011 2011 HPCMP User Group © Cray Inc. 141
  • 135.
     Important performancestatistics:  Top time consuming routines  Load balance across computing resources  Communication overhead  Cache utilization  FLOPS  Vectorization (SSE instructions)  Ratio of computation versus communication June 20, 2011 2011 HPCMP User Group © Cray Inc. 142
  • 136.
     No sourcecode or makefile modification required  Automatic instrumentation at group (function) level  Groups: mpi, io, heap, math SW, …  Performs link-time instrumentation  Requires object files  Instruments optimized code  Generates stand-alone instrumented program  Preserves original binary  Supports sample-based and event-based instrumentation June 20, 2011 2011 HPCMP User Group © Cray Inc. 143
  • 137.
    Analyze the performance data and direct the user to meaningful information  Simplifies the procedure to instrument and collect performance data for novice users  Based on a two phase mechanism 1. Automatically detects the most time consuming functions in the application and feeds this information back to the tool for further (and focused) data collection 2. Provides performance information on the most significant parts of the application June 20, 2011 2011 HPCMP User Group © Cray Inc. 144
  • 138.
     Performs dataconversion  Combines information from binary with raw performance data  Performs analysis on data  Generates text report of performance results  Formats data for input into Cray Apprentice2 June 20, 2011 2011 HPCMP User Group © Cray Inc. 145
  • 139.
     Craypat /Cray Apprentice2 5.0 released September 10, 2009  New internal data format  FAQ  Grid placement support  Better caller information (ETC group in pat_report)  Support larger numbers of processors  Client/server version of Cray Apprentice2  Panel help in Cray Apprentice2 June 20, 2011 2011 HPCMP User Group © Cray Inc. 146
  • 140.
    Access performance tools software % module load perftools  Build application keeping .o files (CCE: -h keepfiles) % make clean % make  Instrument application for automatic profiling analysis  You should get an instrumented program a.out+pat % pat_build –O apa a.out  Run application to get top time consuming routines  You should get a performance file (“<sdatafile>.xf”) or multiple files in a directory <sdatadir> % aprun … a.out+pat (or qsub <pat script>) June 20, 2011 2011 HPCMP User Group © Cray Inc. 147
  • 141.
    Generate report and .apa instrumentation file % pat_report –o my_sampling_report [<sdatafile>.xf | <sdatadir>]  Inspect .apa file and sampling report  Verify if additional instrumentation is needed June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 148
  • 142.
    # You canedit this file, if desired, and use it # 43.37% 99659 bytes # to reinstrument the program for tracing like this: -T mlwxyz_ # # pat_build -O mhd3d.Oapa.x+4125-401sdt.apa # 16.09% 17615 bytes # -T half_ # These suggested trace options are based on data from: # # 6.82% 6846 bytes # /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, -T artv_ /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf # 1.29% 5352 bytes # ---------------------------------------------------------------------- -T currenh_ # HWPC group to collect by default. # 1.03% 25294 bytes -T bndbo_ -Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics. # Functions below this point account for less than 10% of samples. # ---------------------------------------------------------------------- # Libraries to trace. # 1.03% 31240 bytes # -T bndto_ -g mpi ... # ---------------------------------------------------------------------- # ---------------------------------------------------------------------- # User-defined functions to trace, sorted by % of samples. # Limited to top 200. A function is commented out if it has < 1% -o mhd3d.x+apa # New instrumented program. # of samples, or if a cumulative threshold of 90% has been reached, # or if it has size < 200 bytes. /work/crayadm/ldr/mhd3d/mhd3d.x # Original program. # Note: -u should NOT be specified as an additional option. June 20, 2011 149 2011 HPCMP User Group © Cray Inc.
  • 143.
    biolib Cray Bioinformatics library routines  omp OpenMP API (not supported on  blacs Basic Linear Algebra communication Catamount) subprograms  omp-rtl OpenMP runtime library (not  blas Basic Linear Algebra subprograms supported on Catamount)  caf Co-Array Fortran (Cray X2 systems only)  portals Lightweight message passing API  fftw Fast Fourier Transform library (64-bit  pthreads POSIX threads (not supported on only) Catamount)  hdf5 manages extremely large and complex  scalapack Scalable LAPACK data collections  shmem SHMEM  heap dynamic heap  stdio all library functions that accept or return  io includes stdio and sysio groups the FILE* construct  lapack Linear Algebra Package  sysio I/O system calls  lustre Lustre File System  system system calls  math ANSI math  upc Unified Parallel C (Cray X2 systems only)  mpi MPI  netcdf network common data form (manages array-oriented scientific data) 2011 HPCMP User Group © Cray Inc. June 20, 2011 150
  • 144.
    0 Summary with instruction 11 Floating point operations metrics mix (2) 1 Summary with TLB metrics 12 Floating point operations mix (vectorization) 2 L1 and L2 metrics 13 Floating point operations 3 Bandwidth information mix (SP) 4 Hypertransport information 14 Floating point operations 5 Floating point mix mix (DP) 6 Cycles stalled, resources 15 L3 (socket-level) idle 16 L3 (core-level reads) 7 Cycles stalled, resources 17 L3 (core-level misses) full 18 L3 (core-level fills caused 8 Instructions and branches by L2 evictions) 9 Instruction cache 19 Prefetches 10 Cache hierarchy 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 151
  • 145.
     Regions, usefulto break up long routines  int PAT_region_begin (int id, const char *label)  int PAT_region_end (int id)  Disable/Enable Profiling, useful for excluding initialization  int PAT_record (int state)  Flush buffer, useful when program isn’t exiting cleanly  int PAT_flush_buffer (void) 2011 HPCMP User Group © Cray Inc. June 20, 2011 153
  • 146.
    Instrument application for further analysis (a.out+apa) % pat_build –O <apafile>.apa  Run application % aprun … a.out+apa (or qsub <apa script>)  Generate text report and visualization file (.ap2) % pat_report –o my_text_report.txt [<datafile>.xf | <datadir>]  View report in text and/or with Cray Apprentice2 % app2 <datafile>.ap2 June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 154
  • 147.
     MUST runon Lustre ( /work/… , /lus/…, /scratch/…, etc.)  Number of files used to store raw data  1 file created for program with 1 – 256 processes  √n files created for program with 257 – n processes  Ability to customize with PAT_RT_EXPFILE_MAX June 20, 2011 2011 HPCMP User Group © Cray Inc. 155
  • 148.
     Full tracefiles show transient events but are too large  Current run-time summarization misses transient events  Plan to add ability to record:  Top N peak values (N small)‫‏‬  Approximate std dev over time  For time, memory traffic, etc.  During tracing and sampling June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 156
  • 149.
     Call graphprofile  Cray Apprentice2  Communication statistics  is target to help identify and  Time-line view correct:  Communication  Load imbalance  I/O  Excessive communication  Network contention  Activity view  Excessive serialization  Pair-wise communication statistics  I/O Problems  Text reports  Source code mapping June 20, 2011 157 2011 HPCMP User Group © Cray Inc.
  • 150.
    Switch Overview display June20, 2011 2011 HPCMP User Group © Cray Inc. 158
  • 151.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 Slide 159
  • 152.
    June 20, 2011 2011 HPCMP User Group © Cray Inc. 160
  • 153.
    June 20, 2011 2011 HPCMP User Group © Cray Inc. 161
  • 154.
    Min, Avg, andMax Values -1, +1 Std Dev marks June 20, 2011 2011 HPCMP User Group © Cray Inc. 162
  • 155.
    Width  inclusivetime Height  exclusive time Filtered nodes or sub tree Load balance overview: Height  Max time Middle bar  Average time DUH Button: Lower bar  Min time Provides hints Yellow represents for performance imbalance time tuning Function Zoom List June 20, 2011 2011 HPCMP User Group © Cray Inc. 163
  • 156.
    Right mouse click: Node menu e.g., hide/unhide children Right mouse click: View menu: e.g., Filter Sort options % Time, Time, Imbalance % Imbalance time Function List off June 20, 2011 2011 HPCMP User Group © Cray Inc. 164
  • 157.
    June 20, 2011 2011 HPCMP User Group © Cray Inc. 165
  • 158.
    June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 166
  • 159.
    June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 167
  • 160.
    June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 168
  • 161.
    Min, Avg, andMax Values -1, +1 Std Dev marks June 20, 2011 2011 HPCMP User Group © Cray Inc. 169
  • 162.
    June 20, 2011 2011 HPCMP User Group © Cray Inc. 170
  • 163.
     Cray Apprentice2panel help  pat_help – interactive help on the Cray Performance toolset  FAQ available through pat_help June 20, 2011 2011 HPCMP User Group © Cray Inc. 171
  • 164.
     intro_craypat(1)  Introduces the craypat performance tool  pat_build  Instrument a program for performance analysis  pat_help  Interactive online help utility  pat_report  Generate performance report in both text and for use with GUI  hwpc(3)  describes predefined hardware performance counter groups  papi_counters(5)  Lists PAPI event counters  Use papi_avail or papi_native_avail utilities to get list of events when running on a specific architecture June 20, 2011 2011 HPCMP User Group © Cray Inc. 172
  • 165.
    pat_report: Help for-O option: Available option values are in left column, a prefix can be specified: ct -O calltree defaults Tables that would appear by default. heap -O heap_program,heap_hiwater,heap_leaks io -O read_stats,write_stats lb -O load_balance load_balance -O lb_program,lb_group,lb_function mpi -O mpi_callers --- callers Profile by Function and Callers callers+hwpc Profile by Function and Callers callers+src Profile by Function and Callers, with Line Numbers callers+src+hwpc Profile by Function and Callers, with Line Numbers calltree Function Calltree View calltree+hwpc Function Calltree View calltree+src Calltree View with Callsite Line Numbers calltree+src+hwpc Calltree View with Callsite Line Numbers ... June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 173
  • 166.
     Interactive bydefault, or use trailing '.' to just print a topic:  New FAQ craypat 5.0.0.  Has counter and counter group information % pat_help counters amd_fam10h groups . June 20, 2011 2011 HPCMP User Group © Cray Inc. 174
  • 167.
    The top levelCrayPat/X help topics are listed below. A good place to start is: overview If a topic has subtopics, they are displayed under the heading "Additional topics", as below. To view a subtopic, you need only enter as many initial letters as required to distinguish it from other items in the list. To see a table of contents including subtopics of those subtopics, etc., enter: toc To produce the full text corresponding to the table of contents, specify "all", but preferably in a non-interactive invocation: pat_help all . > all_pat_help pat_help report all . > all_report_help Additional topics: API execute balance experiment build first_example counters overview demos report environment run pat_help (.=quit ,=back ^=up /=top ~=search) => June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 175
  • 168.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 176
  • 169.
     ATP (AbnormalTermination Processing) or What do you do when task a causes b to crash  Load the ATP Module before compiling  Set ATP_ENABLED before running  Limitations  ATP disables core dumping. When ATP is running, an applications crash does not produce a core dump.  When ATP is running, the application cannot be checkpointed.  ATP does not support threaded application processes.  ATP has been tested at 10,000 cores. Behavior at core counts greater than 10,000 is still being researched. Cray Proprietary April 19, 2011 177
  • 170.
    Application 926912 iscrashing. ATP analysis proceeding... Stack walkback for Rank 3 starting: _start@start.S:113 __libc_start_main@libc-start.c:220 main@testMPIApp.c:83 foo@testMPIApp.c:47 raise@pt-raise.c:42 Stack walkback for Rank 3 done Process died with signal 4: 'Illegal instruction' View application merged backtrace tree file 'atpMergedBT.dot' with 'statview' You may need to 'module load stat'. Cray Proprietary April 19, 2011 178
  • 171.
    Cray Proprietary April 19, 2011 179
  • 172.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 180
  • 173.
     What CCMis NOT  It is Not a virtual machine or any os within an os  It is NOT an emulator Cray Proprietary April 19, 2011 181
  • 174.
     What isCCM Then?  Provides the runtime environment on compute nodes expected by ISV applications  Dynamically allocates and configures compute nodes at job start  Nodes are not permanently dedicated to CCM  Any compute node can be used  Allocated like any other batch job (on demand)  MPI and third-party MPI runs over TCP/IP using high-speed network  Supports standard services: ssh, rsh, nscd, ldap  Complete root file system on the compute nodes  Built on top of the Dynamic Shared Libraries (DSL) environment  Apps run under CCM: Abaqus, Matlab, Castep, Discoverer, Dmo13, Mesodyn, Ensight and more Under CCM, everything the application can “see” is like a standard Linux cluster: Linux OS, x86 processor, and MPI
  • 175.
    Cray XT6/XE6 System ESM Mode Runn Compute Nodes CCM Mode Runn ESM Mode Idle Service Nodes • Many applications running in Extreme Scalability Mode (ESM) • Submit CCM application through batch scheduler, nodes reserved qsub –l ccm=1 Qname AppScript • Previous jobs finish, nodes configured for CCM • Executes the batch script and application • Other nodes scheduled for ESM or CCM applications as available • After CCM job completes, CCM nodes cleared • CCM nodes available for ESM or CCM applications Cray Product Roadmap - Presented Under NDA 11/03/2010
  • 176.
     Support MPIsthat are configured to work with the OFED stack  CCM1 supports ISV Applications over TCP/IP only  CCM2 supports ISV Applications over TCP/IP and Gemini on XE6  ISV Application Acceleration (IAA) directly utilizes HSN through the Gemini user-space APIs.  Goal of IAA/CCM2 is to deliver latency and bandwidth improvement over CCM1 over TCP/IP.  CCM2 infrastructure is currently in system test.  IAA design and implementation phase is complete  CCM2 with IAA is currently in integration test phase
  • 177.
     A codebinary compiled for SLES and an Opteron  DSO’s are OK  A third party MPI library that can use TCP/IP  We have tried OpenMPI, HP-MPI, LAM-MPI.  Most of the bigger apps are packaged with their own library (usually HP-MPI)  Add CCMRUN to the run script.  The IP address of the License server for the Applications  Note that right now CCM cannot do an NSLOOKUP  LMHOSTS must be specified by IP address  With CLE 4.0: An MPI Library that IBVERBS Cray Proprietary April 19, 2011 185
  • 178.
     CCMRUN: Analogousto aprun runs a third party batch job  In Most cases if you already have a runscript for your third party app adding ccmrun prior to the application command will set it up.  CCMLOGIN: Allows interactive access to the head node of a allocated compute pool. Takes Optional ssh options  CCM uses the ssh known_hosts to set up a an paswordless ssh between a set of compute nodes. You can go to allocated nodes but no further. Cray Proprietary April 19, 2011 186
  • 179.
     External Login Servers XE6  Internal Login System Nodes (PBS Nodes)  Compute Nodes External Login Server Boot RAID 10 GbE IB QDR Cray Proprietary April 19, 2011 187
  • 180.
     External LoginNodes: Dell 4 socket servers that the user enters the System Over  PBS Nodes: Internal single socket 6 core nodes that run the PBS MOM’s  Aprun must be issued from a node on the System Database  Compute Nodes: 2 Socket 8 Core Opteron nodes that run trimmed down OS (still Linux) Cray Proprietary April 19, 2011 188
  • 181.
    news: diskuse_work diskuse_homesystem_info.txt aminga@garnet01:~> uname -a Linux garnet01 2.6.27.48-0.12-default #1 SMP 2010-09-20 11:03:26 -0400 x86_64 x86_64 x86_64 GNU/Linux aminga@garnet01:~> qsub -I -lccm=1 -q debug -l walltime=01:00:00 -l ncpus=32 -A ERDCS97290STA qsub: waiting for job 104868.sdb to start qsub: job 104868.sdb ready In CCM JOB: 104868.sdb JID sdb USER aminga GROUP erdcssta Initializing CCM environment, Please Wait CCM Start success, 2 of 2 responses aminga@garnet13:~> uname -a Linux garnet13 2.6.27.48-0.12.1_1.0301.5737-cray_gem_s #1 SMP Mon Mar 28 22:20:59 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux Cray Proprietary April 19, 2011 189
  • 182.
    aminga@garnet13:~> cat $PBS_NODEFILE nid00972 aminga@garnet13:~> ccmlogin nid00972 Last login: Mon Jun 13 13:03:26 2011 from nid01028 nid00972 nid00972 nid00972 ------------------------------------------------------------------------------- nid00972 nid00972 aminga@nid00972:~> uname -a nid00972 Linux nid00972 2.6.27.48-0.12.1_1.0301.5737-cray_gem_c #1 <snip> SMP Mon Mar 28 22:26:26 UTC 2011 x86_64 x86_64 x86_64 nid01309 GNU/Linux nid01309 nid01309 aminga@nid00972:~> ssh nid01309 nid01309 Try `uname --help' for more information. nid01309 aminga@nid01309:~> uname -a nid01309 nid01309 Linux nid01309 2.6.27.48-0.12.1_1.0301.5737-cray_gem_c #1 nid01309 SMP Mon Mar 28 22:26:26 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux aminga@nid01309:~> aminga@nid00972:~> ssh nid01310 Redirecting to /etc/ssh/ssh_config ssh: connect to host nid01310 portProprietaryConnection refused Cray 203: April 19, 2011 190
  • 183.
    #!/bin/csh #PBS -l mppwidth=2 #PBS-l mppnppn=1 #PBS -q ccm_queue #PBS -j oe cd $PBS_O_WORKDIR perl ConstructMachines.LINUX.pl setenv DSD_MachineLIST $PBS_O_WORKDIR/machines.LINUX setenv MPI_COMMAND " /usr/local/applic/accelrys/MSModeling5.5/hpmpi/opt/hpmpi/bin/mpi run -np " ccmrun ./RunDiscover.sh -np 2 nvt_m Cray Proprietary April 19, 2011 191
  • 184.
    #PBS -l mppwidth=2 #PBS-l mppnppn=1 #PBS -j oe #PBS -N gauss-test-ccm #PBS -q ccm_queue cd $PBS_O_WORKDIR cp $PBS_NODEFILE node_file ./CreatDefaultRoute.pl mkdir -p scratch setenv DVS_CACHE off setenv g09root /usr/local/applic/gaussian/ setenv GAUSS_EXEDIR ${g09root}/g09 setenv GAUSS_EXEDIR ${g09root}/g09/linda-exe:$GAUSS_EXEDIR setenv GAUSS_SCRDIR `pwd` setenv TMPDIR `pwd` source ${g09root}/g09/bsd/g09.login setenv GAUSS_LFLAGS "-vv -nodefile node_file -opt Tsnet.Node.lindarsharg:ssh" setenv LINDA_PATH ${g09root}/g09/linda8.2/opteron-linux set LINDA_LAUNCHVERBOSE=1 ccmrun ${g09root}/g09/g09 < gauss-test-ccm.com setenv TEND `echo "print time();" | perl` echo "Gaussian CCM walltime: `expr $TEND - $TBEGIN` seconds" Cray Proprietary April 19, 2011 192
  • 185.
    cd $PBS_O_WORKDIR /bin/rm -rfbhost.def cat $PBS_NODEFILE > bhost.def /bin/rm -rf job.script cat > job.script << EOD #!/bin/csh set echo cd $PWD setenv AEROSOFT_HOME /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft setenv LAMHOME /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft setenv PATH /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft/bin:$PATH setenv TMPDIR /work/aminga ln -s /usr/lib64/libpng.so libpng.so.3 setenv LD_LIBRARY_PATH `pwd`:$LD_LIBRARY_PATH setenv LAMRSH "ssh -x" lamboot bhost.def time mpirun -np 2 -x LD_LIBRARY_PATH gasp --mpi -i duct.xml --run 2 --elmhost 140.31. 9.44 EOD chmod +x job.script ccmrun job.script Cray Proprietary April 19, 2011 193
  • 186.
    #!/bin/sh #PBS -q ccm_queue #PBS-lmppwidth=48 #PBS -j oe #PBS -N CFX cd $PBS_O_WORKDIR TOP_DIR=/usr/local/applic/ansys export ANSYSLIC_DIR=$TOP_DIR/shared_files/licensing export LD_LIBRARY_PATH=$TOP_DIR/v121/CFX/tools/hpmpi-2.3/Linux-amd64/lib/linux_amd64: $LD_LIBRARY_PATH export PATH=$TOP_DIR/v121/CFX/bin:$PATH export CFX5RSH=ssh export MPIRUN_OPTIONS="-TCP -prot -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7,8,9,10,11,12,13,1 4,15,16,17,18,19,20,21,22,23" /bin/rm -rf host.list cat $PBS_NODEFILE > host.list export proc_list=`sort host.list | uniq -c | awk '{ printf("%s*%s ", $2, $1) ; }'` echo $proc_list which cfx5solve ccmrun cfx5solve -def S*400k.def -par-dist "$proc_list" -start-method "HP MPI Distributed Para llel" rm -f host.list Cray Proprietary April 19, 2011 194
  • 187.
    #!/bin/bash #PBS -lmppwidth=16 #PBS -qccm_queue #PBS -j oe #PBS -N abaqus_e1 cd $PBS_O_WORKDIR TMPDIR=. ABAQUS=/usr/local/applic/abaqus #cp ${ABAQUS}/input/e1.inp e1.inp cat $PBS_NODEFILE echo "Run Abaqus" ccmrun ${ABAQUS}/6.10-1/exec/abq6101.exe input=e1.inp job=e1 cpus=16 interactive Cray Proprietary April 19, 2011 195
  • 188.
    #!/bin/csh #PBS -q ccm_queue #PBS-l mppwidth=32 #PBS -j oe #PBS -N AFRL_Fluent cd $PBS_O_WORKDIR setenv FLUENT_HOME /usr/local/applic/fluent/12.1/fluent setenv FLUENT_ARCH lnamd64 setenv PATH /usr/local/applic/fluent/12.1/v121/fluent/bin:$PATH setenv FLUENT_INC /usr/local/applic/fluent/12.1/v121/fluent ###setenv LM_LICENSE_FILE 7241@10.128.0.72 setenv LM_LICENSE_FILE 27000@10.128.0.76 setenv ANSYSLMD_LICENSE_FILE /home/applic/ansys/shared_files/licensing/license.dat echo ${LM_LICENSE_FILE} setenv FLUENT_VERSION -r12.1.1 cd $PBS_O_WORKDIR rm -rf host.list cat $PBS_NODEFILE > host.list module load ccm dot setenv MALLOC_MMAP_MAX_ 0 setenv MALLOC_TRIM_THRESHOLD_ 536870912 setenv MPIRUN_OPTIONS " -TCP -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7" setenv MPIRUN_OPTIONS "${MPIRUN_OPTIONS},8,9,10,11,12,13,14,15 " setenv MPI_SOCKBUFSIZE 524288 setenv MPI_WORKDIR $PWD setenv MPI_COMMD 1024,1024 ccmrun /usr/local/applic/fluent/v121/fluent/bin/fluent -r12.1.2 2ddp -mpi=hp -gu -dri ver null -t4 -i blast.inp > tstfluent-blast.jobout Cray Proprietary April 19, 2011 196
  • 189.
     ALPS allowsyou to run aprun instance per node. Using CCM you can get around that.  So suppose you want to run 16 single core jobs and only use one node qsub -lccm=1 -q debug -l walltime=01:00:00 -l ncpus=16 -A ERDCS97290STA #PBS –J oe cd $PBS_O_WORKDIR ./myapp& ./myapp& ./myapp& ./myapp& ./myapp& ./myapp& Cray Proprietary April 19, 2011 197
  • 190.
    Engineering for Multi-levelParallelism 2011 HPCMP User Group © Cray Inc. June 20, 2011 198
  • 191.
     Flat, all-MPIparallelism is beginning to be too limited as the number of compute cores rapidly increase  It is becoming necessary to design applications with multiple levels of parallelism:  High-level MPI parallelism between nodes  You’re probably already doing this  Loose, on-node parallelism via threads at a high level  Most codes today are using MPI, but threading is becoming more important  Tight, on-node, vector parallelism at a low level  SSE/AVX on CPUs  GPU threaded parallelism Programmers need to expose the same parallelism for all future architectures 2011 HPCMP User Group © Cray Inc. June 20, 2011 199
  • 192.
     A benchmarkproblem was defined to closely resemble the target simulation  52 species n-heptane chemistry and 483 grid points per node – 483 * 18,500 nodes = 2 billion Chemistry grid points – Target problem would take two months on today’s Jaguar • Code was benchmarked and profiled on dual-hexcore XT5 • Several kernels identified and extracted into stand-alone driver programs Mini-Apps! Core S3D 2011 HPCMP User Group © Cray Inc. June 20, 2011 200
  • 193.
    Goals: Convert S3D to a hybrid multi-core application suited for a multi-core node with or without an accelerator.  Hoisted several loops up the call tree  Introduced high-level OpenMP  Be able to perform the computation entirely on the accelerator if available. - Arrays and data able to reside entirely on the accelerator. - Data sent from accelerator to host CPU for halo communication, I/O and monitoring only. Strategy:  To program using both hand-written and generated code. - Hand-written and tuned CUDA*. - Automated Fortran and CUDA generation for chemistry kernels - Automated code generation through compiler directives  S3D kernels are now a part of Cray’s compiler development test cases * Note: CUDA refers to CUDA-Fortran, unless mentioned otherwise 2011 HPCMP User Group © Cray Inc. June 20, 2011 201
  • 194.
    RHS – Called6 times for each time step – Runge Kutta iterations All major loops are at low level of Calculate Primary Variable – point wise the Mesh loops within 5 different routines Call tree Green – major computation – point-wise Perform Derivative computation – High order differencing Yellow – major computation – Halos 5 zones thick Calculate Diffusion – 3 different routines with some derivative computation Perform Derivative computation for forming rhs – lots of communication Perform point-wise chemistry computation 2011 HPCMP User Group © Cray Inc. June 20, 2011 202
  • 195.
    RHS – Called6 times for each time step – Runge Kutta iterations Calculate Primary Variable – point wise OMP loop over grid Mesh loops within 3 different routines Perform Derivative computation – High order differencing Overlapped OMP loop over grid Calculate Primary Variable – point wise Mesh loops within 2 different routines Calculate Diffusion – 3 different routines with some derivative computation Perform derivative computation Overlapped OMP loop over grid Perform point-wise chemistry computation (1) Perform Derivative computation for forming rhs – lots of communication Overlapped OMP loop over grid Perform point-wise chemistry computation (2) 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 196.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 204
  • 197.
     Create goodgranularity OpenMP Loop  Improves cache re-use  Reduces Memory usage significantly  Creates a good potential kernel for an accelerator 2011 HPCMP User Group © Cray Inc. 205 June 20, 2011
  • 198.
    CPU Optimizations Optimizing Communication I/O Best Practices 2011 HPCMP User Group © Cray Inc. June 20, 2011 206
  • 199.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 207
  • 200.
    55. 1 ii = 0 56. 1 2-----------< do b = abmin, abmax Poor loop order 57. 1 2 3---------< do j=ijmin, ijmax results in poor 58. 1 2 3 ii = ii+1 striding 59. 1 2 3 jj = 0 The inner-most loop 60. 1 2 3 4-------< do a = abmin, abmax strides on a slow 61. 1 2 3 4 r8----< do i = ijmin, ijmax dimension of each 62. 1 2 3 4 r8 jj = jj+1 array. 63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j) + tmat7(ii,jj) The best the compiler 64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j) can do is unroll. - tmat7(ii,jj) 65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i) Little to no cache - tmat7(ii,jj) reuse. 66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i) + tmat7(ii,jj) 67. 1 2 3 4 r8----> end do 68. 1 2 3 4-------> end do 69. 1 2 3---------> end do 70. 1 2-----------> end do 2011 HPCMP User Group © Cray Inc. June 20, 2011 208
  • 201.
    USER / #1.OriginalLoops ----------------------------------------------------------------- Poor loop order Time% 55.0% results in poor Time 13.938244 secs cache reuse Imb.Time 0.075369 secs Imb.Time% 0.6% For every L1 cache Calls 0.1 /sec 1.0 calls hit, there’s 2 misses DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Overall, only 2/3 of L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills all references were in DATA_CACHE_REFILLS_FROM_SYSTEM: level 1 or 2 cache. ALL 11.931M/sec 166291054 fills PAPI_L1_DCM 23.499M/sec 327533338 misses PAPI_L1_DCA 34.635M/sec 482751044 refs User time (approx) 13.938 secs 36239439807 cycles 100.0%Time Average Time per Call 13.938244 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 32.2% hits 67.8% misses D2 cache hit,miss ratio 49.8% hits 50.2% misses D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses 2011 HPCMP User Group © Cray Inc. June 20, 2011 209
  • 202.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 210
  • 203.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 211
  • 204.
    75. 1 2-----------<do i = ijmin, ijmax 76. 1 2 jj = 0 77. 1 2 3---------< do a = abmin, abmax Reordered loop 78. 1 2 3 4-------< do j=ijmin, ijmax nest 79. 1 2 3 4 jj = jj+1 Now, the inner-most 80. 1 2 3 4 ii = 0 loop is stride-1 on 81. 1 2 3 4 Vcr2--< do b = abmin, abmax both arrays. 82. 1 2 3 4 Vcr2 ii = ii+1 83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j) Now memory + tmat7(ii,jj) accesses happen 84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j) along the cache line, - tmat7(ii,jj) allowing reuse. 85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i) - tmat7(ii,jj) Compiler is able to 86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i) vectorize and better- + tmat7(ii,jj) use SSE instructions. 87. 1 2 3 4 Vcr2--> end do 88. 1 2 3 4-------> end do 89. 1 2 3---------> end do 90. 1 2-----------> end do 2011 HPCMP User Group © Cray Inc. June 20, 2011 212
  • 205.
    USER / #2.ReorderedLoops ----------------------------------------------------------------- Improved striding Time% 31.4% greatly improved Time 7.955379 secs cache reuse Imb.Time 0.260492 secs Imb.Time% 3.8% Runtine was cut Calls 0.1 /sec 1.0 calls nearly in half. DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Still, some 20% of all L2_EXCLUSIVE:L2_SHARED 0.419M/sec 3331289 fills references are cache DATA_CACHE_REFILLS_FROM_SYSTEM: misses ALL 15.285M/sec 121598284 fills PAPI_L1_DCM 13.330M/sec 106046801 misses PAPI_L1_DCA 66.226M/sec 526855581 refs User time (approx) 7.955 secs 20684020425 cycles 100.0%Time Average Time per Call 7.955379 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 79.9% hits 20.1% misses D2 cache hit,miss ratio 2.7% hits 97.3% misses D1+D2 cache hit,miss ratio 80.4% hits 19.6% misses 2011 HPCMP User Group © Cray Inc. June 20, 2011 213
  • 206.
    First loop, partiallyvectorized and Second loop, vectorized and unrolled by 4 unrolled by 4 95. 1 ii = 0 109. 1 jj = 0 96. 1 2-----------< do j = ijmin, ijmax 110. 1 2-----------< do i = ijmin, ijmax 97. 1 2 i---------< do b = abmin, abmax 111. 1 2 3---------< do a = abmin, abmax 98. 1 2 i ii = ii+1 112. 1 2 3 jj = jj+1 99. 1 2 i jj = 0 113. 1 2 3 ii = 0 100. 1 2 i i-------< do i = ijmin, ijmax 114. 1 2 3 4-------< do j = ijmin, ijmax 101. 1 2 i i Vpr4--< do a = abmin, abmax 115. 1 2 3 4 Vr4---< do b = abmin, abmax 102. 1 2 i i Vpr4 jj = jj+1 116. 1 2 3 4 Vr4 ii = ii+1 103. 1 2 i i Vpr4 f5d(a,b,i,j) = 117. 1 2 3 4 Vr4 f5d(b,a,i,j) = f5d(a,b,i,j) + tmat7(ii,jj) f5d(b,a,i,j) - tmat7(ii,jj) 104. 1 2 i i Vpr4 f5d(a,b,j,i) = 118. 1 2 3 4 Vr4 f5d(b,a,j,i) = f5d(a,b,j,i) - tmat7(ii,jj) f5d(b,a,i,j) + tmat7(ii,jj) 105. 1 2 i i Vpr4--> end do 119. 1 2 3 4 Vr4---> end do 106. 1 2 i i-------> end do 120. 1 2 3 4-------> end do 107. 1 2 i---------> end do 121. 1 2 3---------> end do 108. 1 2-----------> end do 122. 1 2-----------> end do 2011 HPCMP User Group © Cray Inc. June 20, 2011 214
  • 207.
    USER / #3.FissionedLoops Fissioning further ----------------------------------------------------------------- improved cache reuse Time% 9.8% and resulted in better Time 2.481636 secs vectorization Imb.Time 0.045475 secs Imb.Time% 2.1% Runtime further Calls 0.4 /sec 1.0 calls reduced. DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Cache hit/miss ratio L2_EXCLUSIVE:L2_SHARED 1.175M/sec 2916610 fills improved slightly DATA_CACHE_REFILLS_FROM_SYSTEM: ALL 34.109M/sec 84646518 fills Loopmark file points PAPI_L1_DCM 26.424M/sec 65575972 misses to better PAPI_L1_DCA 156.705M/sec 388885686 refs vectorization from User time (approx) 2.482 secs 6452279320 cycles 100.0%Time the fissioned loops Average Time per Call 2.481636 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 83.1% hits 16.9% misses D2 cache hit,miss ratio 3.3% hits 96.7% misses D1+D2 cache hit,miss ratio 83.7% hits 16.3% misses 2011 HPCMP User Group © Cray Inc. June 20, 2011 215
  • 208.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 216
  • 209.
    ( 52) C THE ORIGINAL ( 53) ( 54) DO 47020 J = 1, JMAX Triple nested loop at a high level ( 55) DO 47020 K = 1, KMAX ( 56) DO 47020 I = 1, IMAX ( 57) JP = J + 1 ( 58) JR = J - 1 ( 59) KP = K + 1 ( 60) KR = K - 1 ( ( 61) 62) IP IR = = I I + - 1 1 Ifs inside the inner loop can ( 63) IF (J .EQ. 1) GO TO 50 signifantly reduce the chances ( 64) IF( J .EQ. JMAX) GO TO 51 ( 65) XJ = ( A(I,JP,K) - A(I,JR,K) ) * DA2 of vectorization ( 66) YJ = ( B(I,JP,K) - B(I,JR,K) ) * DA2 ( 67) ZJ = ( C(I,JP,K) - C(I,JR,K) ) * DA2 ( 68) GO TO 70 ( 69) 50 J1 = J + 1 ( 70) J2 = J + 2 ( 71) XJ = (-3. * A(I,J,K) + 4. * A(I,J1,K) - A(I,J2,K) ) * DA2 ( 72) YJ = (-3. * B(I,J,K) + 4. * B(I,J1,K) - B(I,J2,K) ) * DA2 ( 73) ZJ = (-3. * C(I,J,K) + 4. * C(I,J1,K) - C(I,J2,K) ) * DA2 ( 74) GO TO 70 ( 75) 51 J1 = J - 1 ( 76) J2 = J - 2 ( 77) XJ = ( 3. * A(I,J,K) - 4. * A(I,J1,K) + A(I,J2,K) ) * DA2 ( 78) YJ = ( 3. * B(I,J,K) - 4. * B(I,J1,K) + B(I,J2,K) ) * DA2 ( 79) ZJ = ( 3. * C(I,J,K) - 4. * C(I,J1,K) + C(I,J2,K) ) * DA2 ( 80) 70 CONTINUE ( 81) IF (K .EQ. 1) GO TO 52 ( 82) IF (K .EQ. KMAX) GO TO 53 ( 83) XK = ( A(I,J,KP) - A(I,J,KR) ) * DB2 ( 84) YK = ( B(I,J,KP) - B(I,J,KR) ) * DB2 ( 85) ZK = ( C(I,J,KP) - C(I,J,KR) ) * DB2 ( 86) GO TO 71 continues… 2011 HPCMP User Group © Cray Inc. June 20, 2011 217
  • 210.
    PGI 55, Invariant iftransformation Loop not vectorized: loop count too small 56, Invariant if transformation 2011 HPCMP User Group © Cray Inc. June 20, 2011 218
  • 211.
    ( 141) C THE RESTRUCTURED ( 142) ( 143) DO 47029 J = 1, JMAX ( 144) DO 47029 K = 1, KMAX ( 145) Stride-1 loop brought inside ( 146) IF(J.EQ.1)THEN ( 147) if statements ( 148) J1 = 2 ( 149) J2 = 3 ( 150) DO 47021 I = 1, IMAX ( 151) VAJ(I) = (-3. * A(I,J,K) + 4. * A(I,J1,K) - A(I,J2,K) ) * DA2 ( 152) VBJ(I) = (-3. * B(I,J,K) + 4. * B(I,J1,K) - B(I,J2,K) ) * DA2 ( 153) VCJ(I) = (-3. * C(I,J,K) + 4. * C(I,J1,K) - C(I,J2,K) ) * DA2 ( 154) 47021 CONTINUE ( 155) ( 156) ELSE IF(J.NE.JMAX) THEN ( 157) ( 158) JP = J+1 ( 159) JR = J-1 ( 160) DO 47022 I = 1, IMAX ( 161) VAJ(I) = ( A(I,JP,K) - A(I,JR,K) ) * DA2 ( 162) VBJ(I) = ( B(I,JP,K) - B(I,JR,K) ) * DA2 ( 163) VCJ(I) = ( C(I,JP,K) - C(I,JR,K) ) * DA2 ( 164) 47022 CONTINUE ( 165) ( 166) ELSE ( 167) ( 168) J1 = JMAX-1 ( 169) J2 = JMAX-2 ( 170) DO 47023 I = 1, IMAX ( 171) VAJ(I) = ( 3. * A(I,J,K) - 4. * A(I,J1,K) + A(I,J2,K) ) * DA2 ( 172) VBJ(I) = ( 3. * B(I,J,K) - 4. * B(I,J1,K) + B(I,J2,K) ) * DA2 ( 173) VCJ(I) = ( 3. * C(I,J,K) - 4. * C(I,J1,K) + C(I,J2,K) ) * DA2 ( 174) 47023 CONTINUE ( 175) ( 176) ENDIF Continues… 2011 HPCMP User Group © Cray Inc. June 20, 2011 219
  • 212.
    PGI 144, Invariantif transformation Loop not vectorized: loop count too small 150, Generated 3 alternate loops for the inner loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop 160, Generated 4 alternate loops for the inner loop Generated vector sse code for inner loop Generated 6 prefetch instructions for this loop Generated vector sse code for inner loop ooo 2011 HPCMP User Group © Cray Inc. June 20, 2011 220
  • 213.
    2500 2000 M 1500 F L O P 1000 S 500 0 0 50 100 150 200 250 300 350 400 450 500 Vector Length CCE-Original - Fortran CCE-Restructured- Fortran PGI-Original - Fortran PGI-Restructured - Fortran 2011 HPCMP User Group © Cray Inc. June 20, 2011 221
  • 214.
     Max Vectorlength doubled to 256 bit  Much cleaner instruction set  Result register is unique from the source registers  Old SSE instruction set always destroyed a source register  Floating point multiple-accumulate  A(1:4) = B(1:4)*C(1:4) + D(1:4) ! Now one instruction  Next gen of both AMD and Intel will have AVX  Vectors are becoming more important, not less 2011 HPCMP User Group © Cray Inc. June 20, 2011 222
  • 215.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 223
  • 216.
     Cache blockingis a combination of strip mining and loop interchange, designed to increase data reuse.  Takes advantage of temporal reuse: re-reference array elements already referenced  Good blocking will take advantage of spatial reuse: work with the cache lines!  Many ways to block any given loop nest  Which loops get blocked?  What block size(s) to use?  Analysis can reveal which ways are beneficial  But trial-and-error is probably faster 2011 HPCMP User Group © Cray Inc. June 20, 2011 224
  • 217.
    j=1 j=8  2D Laplacian i=1 do j = 1, 8 do i = 1, 16 a = u(i-1,j) + u(i+1,j) & - 4*u(i,j) & + u(i,j-1) + u(i,j+1) end do end do  Cache structure for this example:  Each line holds 4 array elements  Cache can hold 12 lines of u data i=16  No cache reuse between outer loop 120 30 18 15 13 12 10 9 7 6 4 3 iterations 2011 HPCMP User Group © Cray Inc. June 20, 2011 225
  • 218.
    j=1 j=8  Unblocked loop: 120 cache misses i=1  Block the inner loop do IBLOCK = 1, 16, 4 do j = 1, 8 i=5 do i = IBLOCK, IBLOCK + 3 a(i,j) = u(i-1,j) + u(i+1,j) & - 2*u(i,j) & i=9 + u(i,j-1) + u(i,j+1) end do end do end do i=13  Now we have reuse of the “j+1” data 80 20 12 10 11 9 8 7 6 4 3 2011 HPCMP User Group © Cray Inc. June 20, 2011 226
  • 219.
    j=1 j=5  One-dimensional blocking reduced i=1 misses from 120 to 80  Iterate over 4×4 blocks i=5 do JBLOCK = 1, 8, 4 do IBLOCK = 1, 16, 4 do j = JBLOCK, JBLOCK + 3 do i = IBLOCK, IBLOCK + 3 i=9 a(i,j) = u(i-1,j) + u(i+1,j) & - 2*u(i,j) & + u(i,j-1) + u(i,j+1) end do i=13 end do end do end do 15 13 12 10 60 30 18 17 16 11 9 8 7 6 4 3  Better use of spatial locality (cache lines) 2011 HPCMP User Group © Cray Inc. June 20, 2011 227
  • 220.
    Matrix-matrix multiply (GEMM) is the canonical cache-blocking example  Operations can be arranged to create multiple levels of blocking  Block for register  Block for cache (L1, L2, L3)  Block for TLB  No further discussion here. Interested readers can see  Any book on code optimization  Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in Chapter 8  Insert your favorite book here  Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR- 2001-22, The University of Texas at Austin, Department of Computer Sciences  Develops algorithms and cost models for GEMM in hierarchical memories  Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (May), 1-25  Description of GotoBLAS DGEMM 2011 HPCMP User Group © Cray Inc. June 20, 2011 228
  • 221.
    “I tried cache-blockingmy code, but it didn’t help”  You’re doing it wrong.  Your block size is too small (too much loop overhead).  Your block size is too big (data is falling out of cache).  You’re targeting the wrong cache level (?)  You haven’t selected the correct subset of loops to block.  The compiler is already blocking that loop.  Prefetching is acting to minimize cache misses.  Computational intensity within the loop nest is very large, making blocking less important. 2011 HPCMP User Group © Cray Inc. June 20, 2011 229
  • 222.
     Multigrid PDEsolver  Class D, 64 MPI ranks do i3 = 2, 257  Global grid is 1024 × 1024 × 1024 do i2 = 2, 257  Local grid is 258 × 258 × 258 do i1 = 2, 257  Two similar loop nests account for ! update u(i1,i2,i3) ! using 27-point stencil >50% of run time end do  27-point 3D stencil end do  There is good data reuse along end doi2+1 cache lines i2 leading dimension, even without i2-1 blocking i3+1 i3 i3-1 i1-1 i1 i1+1 2011 HPCMP User Group © Cray Inc. June 20, 2011 230
  • 223.
     Block theinner two loops Mop/s/proces Block size  Creates blocks extending along i3 direction s unblocked 531.50 do I2BLOCK = 2, 257, BS2 do I1BLOCK = 2, 257, BS1 16 × 16 279.89 do i3 = 2, 257 22 × 22 321.26 do i2 = I2BLOCK, & min(I2BLOCK+BS2-1, 257) 28 × 28 358.96 do i1 = I1BLOCK, & min(I1BLOCK+BS1-1, 257) 34 × 34 385.33 ! update u(i1,i2,i3) ! using 27-point stencil 40 × 40 408.53 end do end do 46 × 46 443.94 end do 52 × 52 468.58 end do end do 58 × 58 470.32 64 × 64 512.03 70 × 70 506.92 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 231
  • 224.
     Block theouter two loops Mop/s/proces Block size  Preserves spatial locality along i1 direction s unblocked 531.50 do I3BLOCK = 2, 257, BS3 do I2BLOCK = 2, 257, BS2 16 × 16 674.76 do i3 = I3BLOCK, & 22 × 22 680.16 min(I3BLOCK+BS3-1, 257) do i2 = I2BLOCK, & 28 × 28 688.64 min(I2BLOCK+BS2-1, 257) do i1 = 2, 257 34 × 34 683.84 ! update u(i1,i2,i3) ! using 27-point stencil 40 × 40 698.47 end do end do 46 × 46 689.14 end do 52 × 52 706.62 end do end do 58 × 58 692.57 64 × 64 703.40 70 × 70 693.87 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 232
  • 225.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 233
  • 226.
    ( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ C pointers ( 57) double con; /* constant value */ C pointers don’t carry ( 58) ( 59) rowb = cola; the same rules as ( 60) rowc = rowa; Fortran Arrays. ( 61) colc = colb; ( 62) The compiler has no ( 63) for(i=0;i<rowc;i++) { way to know whether ( 64) for(k=0;k<cola;k++) { *a, *b, and *c ( 65) con = *(a + i*cola +k); ( 66) for(j=0;j<colc;j++) { overlap or are ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); referenced differently ( 68) } elsewhere. ( 69) } ( 70) } The compiler must ( 71) } assume the worst, mat_mul_daxpy: thus a false data 66, Loop not vectorized: data dependency dependency. Loop not vectorized: data dependency Loop unrolled 4 times 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 234
  • 227.
    ( 53) void mat_mul_daxpy(double* restrict a, double* restrict b, double* restrict c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ C pointers, ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ restricted ( 57) double con; /* constant value */ C99 introduces the ( 58) ( 59) rowb = cola; restrict keyword, ( 60) rowc = rowa; which allows the ( 61) colc = colb; programmer to ( 62) promise not to ( 63) for(i=0;i<rowc;i++) { ( 64) for(k=0;k<cola;k++) { reference the ( 65) con = *(a + i*cola +k); memory via another ( 66) for(j=0;j<colc;j++) { pointer. ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); ( 68) } If you declare a ( 69) } restricted pointer and ( 70) } ( 71) } break the rules, behavior is undefined by the standard. 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 235
  • 228.
    66, Generated alternateloop with no peeling - executed if loop count <= 24 Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with no peeling and more aligned moves - executed if loop count <= 24 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with more aligned moves - executed if loop count >= 25 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop • This can also be achieved with the PGI safe pragma or –Msafeptr compiler option or Pathscale –OPT:alias option 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 236
  • 229.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 Slide 237
  • 230.
     GNU malloclibrary  malloc, calloc, realloc, free calls  Fortran dynamic variables  Malloc library system calls  Mmap, munmap =>for larger allocations  Brk, sbrk => increase/decrease heap  Malloc library optimized for low system memory use  Can result in system calls/minor page faults 2011 HPCMP User Group © Cray Inc. June 20, 2011 238
  • 231.
     Detecting “bad”malloc behavior  Profile data => “excessive system time”  Correcting “bad” malloc behavior  Eliminate mmap use by malloc  Increase threshold to release heap memory  Use environment variables to alter malloc  MALLOC_MMAP_MAX_ = 0  MALLOC_TRIM_THRESHOLD_ = 536870912  Possible downsides  Heap fragmentation  User process may call mmap directly  User process may launch other processes  PGI’s –Msmartalloc does something similar for you at compile time 2011 HPCMP User Group © Cray Inc. June 20, 2011 239
  • 232.
     Google createda replacement “malloc” library  “Minimal” TCMalloc replaces GNU malloc  Limited testing indicates TCMalloc as good or better than GNU malloc  Environment variables not required  TCMalloc almost certainly better for allocations in OpenMP parallel regions  There’s currently no pre-built tcmalloc for Cray XT/XE, but some users have successfully built it. 2011 HPCMP User Group © Cray Inc. June 20, 2011 240
  • 233.
     Linux hasa “first touch policy” for memory allocation  *alloc functions don’t actually allocate your memory  Memory gets allocated when “touched”  Problem: A code can allocate more memory than available  Linux assumed “swap space,” we don’t have any  Applications won’t fail from over-allocation until the memory is finally touched  Problem: Memory will be put on the core of the “touching” thread  Only a problem if thread 0 allocates all memory for a node  Solution: Always initialize your memory immediately after allocating it  If you over-allocate, it will fail immediately, rather than a strange place in your code  If every thread touches its own memory, it will be allocated on the proper socket 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 241
  • 234.
    This may helpboth compute and communication. 2011 HPCMP User Group © Cray Inc. June 20, 2011 242
  • 235.
     Opterons support4K, 2M, and 1G pages  We don’t support 1G pages  4K pages are used by default  2M pages are more difficult to use, but…  Your code may run with fewer TLB misses (hence faster).  The TLB can address more physical memory with 2M pages than with 4K pages  The Gemini perform better with 2M pages than with 4K pages.  2M pages use less GEMINI resources than 4k pages (fewer bytes). 2011 HPCMP User Group © Cray Inc. June 20, 2011 243
  • 236.
     Link inthe hugetlbfs library into your code ‘-lhugetlbfs’  Set the HUGETLB_MORECORE env in your run script.  Example : export HUGETLB_MORECORE=yes  Use the aprun option –m###h to ask for ### Meg of HUGE pages.  Example : aprun –m500h (Request 500 Megs of HUGE pages as available, use 4K pages thereafter)  Example : aprun –m500hs (Request 500 Megs of HUGE pages, if not available terminate launch)  Note: If not enough HUGE pages are available, the cost of filling the remaining with 4K pages may degrade performance. 2011 HPCMP User Group © Cray Inc. June 20, 2011 244
  • 237.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 245
  • 238.
     Short MessageEager Protocol  The sending rank “pushes” the message to the receiving rank  Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less  Sender assumes that receiver can handle the message  Matching receive is posted - or -  Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space (MPICH_UNEX_BUFFER_SIZE) to store the message  Long Message Rendezvous Protocol  Messages are “pulled” by the receiving rank  Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes  Sender sends small header packet with information for the receiver to pull over the data  Data is sent only after matching receive is posted by receiving rank 2011 HPCMP User Group © Cray Inc. June 20, 2011 246
  • 239.
    Match Entries Postedby MPI Incoming Msg S to handle Unexpected Msgs E A Eager Rendezvous S App ME Short Msg ME Long Msg ME T A R STEP 1 STEP 2 MPI_RECV call Sender MPI_SEND call Receiver Post ME to Portals MPI RANK 0 RANK 1 Unexpected Buffers (MPICH_UNEX_BUFFER_SIZE) STEP 3 Portals DMA PUT Unexpected Msg Queue Other Event Queue (MPICH_PTL_OTHER_EVENTS) Unexpected Event Queue MPI_RECV is posted prior to MPI_SEND call (MPICH_PTL_UNEX_EVENTS) 2011 HPCMP User Group © Cray Inc. June 20, 2011 247
  • 240.
    MPT Eager Protocolon SeaStar Match Entries Posted by MPI Data “pushed” to the receiver Incoming Msg S to handle Unexpected Msgs (MPICH_MAX_SHORT_MSG_SIZE bytes or less) E A Eager Rendezvous S Short Msg ME Long Msg ME T A R STEP 3 STEP 1 MPI_RECV call Sender MPI_SEND call Receiver No Portals ME MPI RANK 0 RANK 1 Unexpected Buffers STEP 2 STEP 4 (MPICH_UNEX_BUFFER_SIZE) Portals DMA PUT Memcpy of data Unexpected Msg Queue Unexpected Event Queue (MPICH_PTL_UNEX_EVENTS) MPI_RECV is not posted prior to MPI_SEND call 2011 HPCMP User Group © Cray Inc. June 20, 2011 248
  • 241.
    Match Entries Postedby MPI Incoming Msg S to handle Unexpected Msgs E A Eager Rendezvous S T Short Msg ME Long Msg ME A App ME R STEP 1 STEP 3 MPI_SEND call MPI_RECV call Portals ME created Triggers GET request Sender Receiver MPI RANK 0 RANK 1 Unexpected STEP 2 Buffers Portals DMA PUT of Header STEP 4 Receiver issues GET request to match Sender ME Unexpected Msg Queue STEP 5 Portals DMA of Data Unexpected Event Queue Data is not sent until MPI_RECV is issued 2011 HPCMP User Group © Cray Inc. June 20, 2011 249
  • 242.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 250
  • 243.
     The defaultordering can be changed using the following environment variable: MPICH_RANK_REORDER_METHOD  These are the different values that you can set it to: 0: Round-robin placement – Sequential ranks are placed on the next node in the list. Placement starts over with the first node upon reaching the end of the list. 1: SMP-style placement – Sequential ranks fill up each node before moving to the next. 2: Folded rank placement – Similar to round-robin placement except that each pass over the node list is in the opposite direction of the previous pass. 3: Custom ordering. The ordering is specified in a file named MPICH_RANK_ORDER.  When is this useful?  Point-to-point communication consumes a significant fraction of program time and a load imbalance detected  Also shown to help for collectives (alltoall) on subcommunicators (GYRO)  Spread out IO across nodes (POP) 2011 HPCMP User Group © Cray Inc. June 20, 2011 251
  • 244.
     One canalso use the CrayPat performance measurement tools to generate a suggested custom ordering.  Available if MPI functions traced (-g mpi or –O apa)  pat_build –O apa my_program  see Examples section of pat_build man page  pat_report options:  mpi_sm_rank_order  Uses message data from tracing MPI to generate suggested MPI rank order. Requires the program to be instrumented using the pat_build -g mpi option.  mpi_rank_order  Uses time in user functions, or alternatively, any other metric specified by using the -s mro_metric options, to generate suggested MPI rank order. 2011 HPCMP User Group © Cray Inc. June 20, 2011 252
  • 245.
     module loadxt-craypat  Rebuild your code  pat_build –O apa a.out  Run a.out+pat  pat_report –Ompi_sm_rank_order a.out+pat+…sdt/ > pat.report  Creates MPICH_RANK_REORDER_METHOD.x file  Then set env var MPICH_RANK_REORDER_METHOD=3 AND  Link the file MPICH_RANK_ORDER.x to MPICH_RANK_ORDER  Rerun code 2011 HPCMP User Group © Cray Inc. June 20, 2011 253
  • 246.
    Table 1: Suggested MPI Rank Order Eight cores per node: USER Samp per node Rank Max Max/ Avg Avg/ Max Node Order USER Samp SMP USER Samp SMP Ranks d 17062 97.6% 16907 100.0% 832,328,820,797,113,478,898,600 2 17213 98.4% 16907 100.0% 53,202,309,458,565,714,821,970 0 17282 98.8% 16907 100.0% 53,181,309,437,565,693,821,949 1 17489 100.0% 16907 100.0% 0,1,2,3,4,5,6,7 •This suggests that 1. the custom ordering “d” might be the best 2. Folded-rank next best 3. Round-robin 3rd best 4. Default ordering last 2011 HPCMP User Group © Cray Inc. June 20, 2011 254
  • 247.
     GYRO 8.0  B3-GTC problem with 1024 processes  Run with alternate MPI orderings  Custom: profiled with with –O apa and used reordering file MPICH_RANK_REORDER.d Reorder method Comm. time Default 11.26s CrayPAT 0 – round-robin 6.94s suggestion almost right! 2 – folded-rank 6.68s d-custom from apa 8.03s 2011 HPCMP User Group © Cray Inc. June 20, 2011 255
  • 248.
     TGYRO 1.0  Steady state turbulent transport code using GYRO, NEO, TGLF components  ASTRA test case  Tested MPI orderings at large scale  Originally testing weak-scaling, but found reordering very useful Reorder TGYRO wall time (min) method 20480 40960 81920 Huge win! Default 99m 104m 105m Round-robin 66m 63m 72m 2011 HPCMP User Group © Cray Inc. June 20, 2011 256
  • 249.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 257
  • 250.
    Time % | Time | Imb. Time | Imb. | Calls |Experiment=1 | | | Time % | |Group | | | | | Function | | | | | PE='HIDE' 100.0% | 1530.892958 | -- | -- | 27414118.0 |Total |--------------------------------------------------------------------- | 52.0% | 796.046937 | -- | -- | 22403802.0 |USER ||-------------------------------------------------------------------- || 22.3% | 341.176468 | 3.482338 | 1.0% | 19200000.0 |getrates_ || 17.4% | 266.542501 | 35.451437 | 11.7% | 1200.0 |rhsf_ || 5.1% | 78.772615 | 0.532703 | 0.7% | 3200000.0 |mcavis_new_looptool_ || 2.6% | 40.477488 | 2.889609 | 6.7% | 1200.0 |diffflux_proc_looptool_ || 2.1% | 31.666938 | 6.785575 | 17.6% | 200.0 |integrate_erk_jstage_lt_ || 1.4% | 21.318895 | 5.042270 | 19.1% | 1200.0 |computeheatflux_looptool_ || 1.1% | 16.091956 | 6.863891 | 29.9% | 1.0 |main ||==================================================================== | 47.4% | 725.049709 | -- | -- | 5006632.0 |MPI ||-------------------------------------------------------------------- || 43.8% | 670.742304 | 83.143600 | 11.0% | 2389440.0 |mpi_wait_ || 1.9% | 28.821882 | 281.694997 | 90.7% | 1284320.0 |mpi_isend_ |===================================================================== 2011 HPCMP User Group © Cray Inc. June 20, 2011 258
  • 251.
    Time % | Time | Imb. Time | Imb. | Calls |Experiment=1 | | | Time % | |Group | | | | | Function | | | | | PE='HIDE' 100.0% | 1730.555208 | -- | -- | 16090113.8 |Total |--------------------------------------------------------------------- | 76.9% | 1330.111350 | -- | -- | 4882627.8 |MPI ||-------------------------------------------------------------------- || 72.1% | 1247.436960 | 54.277263 | 4.2% | 2389440.0 |mpi_wait_ || 1.3% | 22.712017 | 101.212360 | 81.7% | 1234718.3 |mpi_isend_ || 1.0% | 17.623757 | 4.642004 | 20.9% | 1.0 |mpi_comm_dup_ || 1.0% | 16.849281 | 71.805979 | 81.0% | 1234718.3 |mpi_irecv_ || 1.0% | 16.835691 | 192.820387 | 92.0% | 19999.2 |mpi_waitall_ ||==================================================================== | 22.2% | 384.978417 | -- | -- | 11203802.0 |USER ||-------------------------------------------------------------------- || 9.9% | 171.440025 | 1.929439 | 1.1% | 9600000.0 |getrates_ || 7.7% | 133.599580 | 19.572807 | 12.8% | 1200.0 |rhsf_ || 2.3% | 39.465572 | 0.600168 | 1.5% | 1600000.0 |mcavis_new_looptool_ |===================================================================== |===================================================================== 2011 HPCMP User Group © Cray Inc. June 20, 2011 259
  • 252.
    MPI Task K+1 MPI Task K MPI Task K - 1 Differencing in the X direction MPI Task K +30 MPI Task K MPI Task K +1200 MPI Task K-30 MPI Task K MPI Task K-1200 Differencing in the Z direction Differencing in the Y direction 2011 HPCMP User Group © Cray Inc. June 20, 2011 260
  • 253.
    Code must performone communication across each surface of a cube 12 cubes perform 72 communications, 63 of which go “off node” Optimized mapping of the MPI tasks on the node Still performs 72 communications, but now only 32 are off node 2011 HPCMP User Group © Cray Inc. June 20, 2011 261
  • 254.
    Rank Reordering Case Study Application data is in a 3D space, X x Y x Z. Communication is nearest-neighbor. Default ordering results in 12x1x1 block on each node. A custom reordering is now generated: 3x2x2 blocks per node, resulting in more on-node communication 2011 HPCMP User Group © Cray Inc. June 20, 2011 262
  • 255.
    % pat_report -Ompi_sm_rank_order -s rank_grid_dim=8,6 ... Notes for table 1: To maximize the locality of point to point communication, specify a Rank Order with small Max and Avg Sent Msg Total Bytes per node for the target number of cores per node. To specify a Rank Order with a numerical value, set the environment variable MPICH_RANK_REORDER_METHOD to the given value. To specify a Rank Order with a letter value 'x', set the environment variable MPICH_RANK_REORDER_METHOD to 3, and copy or link the file MPICH_RANK_ORDER.x to MPICH_RANK_ORDER. Table 1: Sent Message Stats and Suggested MPI Rank Order Communication Partner Counts Number Rank Partners Count Ranks 2 4 0 5 42 47 3 20 1 2 3 4 ... 4 24 7 8 9 10 ... 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 263
  • 256.
    Four cores pernode: Sent Msg Total Bytes per node Rank Max Max/ Avg Avg/ Max Node Order Total Bytes SMP Total Bytes SMP Ranks g 121651200 73.9% 86400000 62.5% 14,20,15,21 h 121651200 73.9% 86400000 62.5% 14,20,21,15 u 152064000 92.4% 146534400 106.0% 13,12,10,4 1 164505600 100.0% 138240000 100.0% 16,17,18,19 d 164505600 100.0% 142387200 103.0% 16,17,19,18 0 224640000 136.6% 207360000 150.0% 1,13,25,37 2 241920000 147.1% 207360000 150.0% 7,16,31,40 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 264
  • 257.
    % $CRAYPAT_ROOT/sbin/grid_order -c2,2 -g 8,6 # grid_order -c 2,2 -g 8,6 # Region 0: 0,0 (0..47) 0,1,6,7 2,3,8,9 4,5,10,11 12,13,18,19 14,15,20,21 16,17,22,23 24,25,30,31 26,27,32,33 28,29,34,35 36,37,42,43 38,39,44,45 40,41,46,47 This script will also handle the case that cells do not evenly partition the grid. 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 265
  • 258.
    X X o o X X o o o o o o o o o o  Nodes marked X heavily use a shared resource  If memory bandwidth, scatter the X's  If network bandwidth to others, again scatter  If network bandwidth among themselves, concentrate 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 267
  • 259.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 268
  • 260.
    Call mpi_send(a, 10,…) Call mpi_send(b, 10, …) Each message incurs latency and library overhead Call mpi_send(c, 10, …) Call mpi_send(d, 10, …)  Copy messages into a contiguous buffer and send once Sendbuf(1:10) = a(1:10) Sendbuf(11:20) = b(1:10) Sendbuf(21:30) = c(1:10) Sendbuf(31:40) = d(1:10) Call mpi_send(sendbuf, 40, …) Latency and library overhead incurred only once  Effectiveness of this optimization is machine dependent 2011 HPCMP User Group © Cray Inc. June 20, 2011 269
  • 261.
     Most collectiveshave been tuned to take advantage of algorithms and hardware to maximize performance  MPI_ALLTOALL  Reorder communications to spread traffic around the network efficiently  MPI_BCAST/_REDUCE/_ALLREDUCE  Use tree based algorithms to reduce the number of messages.  Needs to strike a balance between width and depth of tree.  MPI_GATHER  Use tree algorithm to reduce resource contention aggregate messages.  You don’t want to have to reinvent the wheel 2011 HPCMP User Group © Cray Inc. June 20, 2011 270
  • 262.
     MPI_ALLTOALL  Message size decreases as number of ranks grows  Number of messages is O(num_ranks2)  Very difficult to scale to very high core counts  MPI_BCAST/_REDUCE/_ALLREDUCE/_BARRIER  All are O(log (num_ranks))  All represent global sync points  Expose ANY load imbalance in the code  Expose ANY “jitter” induced by the OS or other services  MPI_GATHER  Many-to-one  The greater the frequency of collectives, the harder it will be to scale 2011 HPCMP User Group © Cray Inc. June 20, 2011 271
  • 263.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 272
  • 264.
    Filesystem Program  Lustre, GPFS, and Panasas are  Just as a problem gets partitioned “parallel filesystems” to multiple processors, I/O  I/O operations are broken down to operations can be done in parallel basic units and distributed to  MPI-IO is a standard API for doing multiple endpoints parallel I/O operations  Spreading out operations in this  By performing I/O operations in way can greatly improve parallel, an application can reduce performance at large processor I/O bottlenecks and take counts advantage of parallel filesystems  HDF5, NetCDF, and ADIOS all provide parallel I/O in a portable file format 2011 HPCMP User Group © Cray Inc. June 20, 2011 273
  • 265.
     To maximizeI/O performance, parallel filesystems  Break I/O operations into chunks, much like inodes on standard filesystems, which get distributed among I/O servers  Provide a means of controlling how much concurrency to use for a given file  Make the distributed nature of the data invisible to the program/programmer  File metadata may be distributed (GPFS) or centralized (Lustre)  In order to take advantage of a parallel filesystem, a user must  Ensure that multiple processes are sharing I/O duties, one process is incapable of saturating the filesystem  Prevent multiple processes from using the same “chunk” simultaneously (more important with writes)  Choose a concurrency that is “distributed enough” without spreading data too thin to be effective (ideally, 1 process shouldn’t need to access several I/O servers) 2011 HPCMP User Group © Cray Inc. June 20, 2011 274
  • 266.
     I/O issimply data migration.  Memory Disk  I/O is a very expensive operation.  Interactions with data in memory and on disk.  Must get the kernel involved  How is I/O performed?  I/O Pattern  Number of processes and files.  File access characteristics.  Where is I/O performed?  Characteristics of the computational system.  Characteristics of the file system. 275 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 267.
     There isno “One Size Fits All” solution to the I/O problem.  Many I/O patterns work well for some range of parameters.  Bottlenecks in performance can occur in many locations. (Application and/or File system)  Going to extremes with an I/O pattern will typically lead to problems. 276 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 268.
     The bestperformance comes from situations when the data is accessed contiguously in memory and on disk.  Facilitates large operations and minimizes latency. Memory Disk  Commonly, data access is contiguous in memory but noncontiguous on disk or vice versa. Usually to reconstruct a global data structure via parallel I/O. Memory Disk 277 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 269.
     Spokesperson  One process performs I/O.  Data Aggregation or Duplication  Limited by single I/O process.  Pattern does not scale.  Time increases linearly with amount of data. Disk  Time increases with number of processes. 278 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 270.
     File perprocess  All processes perform I/O to individual files.  Limited by file system.  Pattern does not scale at large process counts.  Number of files creates bottleneck with metadata Disk operations.  Number of simultaneous disk accesses creates contention for file system resources. 279 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 271.
     Shared File  Each process performs I/O to a single file which is shared.  Performance  Data layout within the shared file is very Disk important.  At large process counts contention can build for file system resources. 280 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 272.
     Subset ofprocesses which perform I/O.  Aggregation of a group of processes data.  Serializes I/O in group.  I/O process may access independent files.  Limits the number of files accessed.  Group of processes perform parallel I/O to a shared file.  Increases the number of shared files to increase file system usage.  Decreases number of processes which access a shared file to decrease file system contention. 281 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 273.
     128 MBper file and a 32 MB Transfer size File Per Process Write Performance 12000 10000 8000 Write (MB/s) 1 MB Stripe 6000 32 MB Stripe 4000 2000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Processes or Files 282 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 274.
     32 MBper process, 32 MB Transfer size and Stripe size Single Shared File Write Performance 8000 7000 6000 Write (MB/s) POSIX 5000 4000 MPIIO 3000 HDF5 2000 1000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Processes 283 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 275.
     Lustre  Minimize contention for file system resources.  A process should not access more than one or two OSTs.  Performance  Performance is limited for single process I/O.  Parallel I/O utilizing a file-per-process or a single shared file is limited at large scales.  Potential solution is to utilize multiple shared file or a subset of processes which perform I/O. 284 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 276.
     Standard Ouputand Error streams are effectively serial I/O.  All STDIN, STDOUT, and STDERR I/O serialize through aprun  Disable debugging messages when running in production mode.  “Hello, I’m task 32000!”  “Task 64000, made it through Lustre loop.” 285 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 277.
     Advantages  Aggregates smaller read/write operations into larger operations.  Examples: OS Kernel Buffer, MPI-IO Collective Buffering  Disadvantages Buffer  Requires additional memory for the buffer.  Can tend to serialize I/O.  Caution  Frequent buffer flushes can adversely affect performance. 286 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 278.
     If anapplication does extremely small, irregular I/O, explicitly buffering may improve performance.  A post processing application writes a 1GB file.  This case study is an extreme example.  This occurs from one writer, but occurs in many small write operations.  Takes 1080 s (~ 18 minutes) to complete.  IOBUF was utilized to intercept these writes with 64 MB buffers.  Takes 4.5 s to complete. A 99.6% reduction in time. Lustre File "ssef_cn_2008052600f000" Calls Seconds Megabytes Megabytes/sec Avg Size Open 1 0.001119 Read 217 0.247026 0.105957 0.428931 512 Write 2083634 1.453222 1017.398927 700.098632 512 Close 1 0.220755 Total 2083853 1.922122 1017.504884 529.365466 512 Sys Read 6 0.655251 384.000000 586.035160 67108864 Sys Write 17 3.848807 1081.145508 280.904052 66686072 Buffers used 4 (256 MB) Prefetches 6 Preflushes 15 287 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 279.
     Writing abig-endian binary file with compiler flag byteswapio File “XXXXXX" Calls Megabytes Avg Size Open 1 Write 5918150 23071.28062 4088 Close 1 Total 5918152 23071.28062 4088  Writing a little-endian binary File “XXXXXX" Calls Megabytes Avg Size Open 1 Write 350 23071.28062 69120000 Close 1 Total 352 23071.28062 69120000 288 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 280.
     MPI-IO allowsmultiple MPI processes to access the same file in a distributed manner  Like other MPI operations, it’s necessary to provide a data type for items being written to the file (may be a derived type)  There are 3 ways to declare the “file position”  Explicit offset: each operation explicitly declares the necessary file offset  Individual File Pointers: Each process has its own unique handle to the file  Shared File Pointers: The MPI library maintains 1 file pointer and determines how to handle parallel access (often via serialization)  For each file position type, there are 2 “coordination” patterns  Non-collective: Each process acts on its own behalf  Collective: The processes coordinate, possibly allowing the library to make smart decisions on how to access the filesystem  MPI-IO allows the user to provide “hints” to improve I/O performance. Often I/O performance can be improved via hints about the filesystem or problem-specific details 2011 HPCMP User Group © Cray Inc. June 20, 2011 289
  • 281.
    int mode, ierr; chartmps[24]; MPI_File fh; MPI_Info info; Open a file across all ranks as read/write. MPI_Status status; Hints can be set between MPI_Info_create and MPI_File_open. mode = MPI_MODE_CREATE|MPI_MODE_RDWR; MPI_Info_create(&info); MPI_File_open(comm, "output/test.dat", mode, info, &fh); Set the “view” (offset) for each rank. MPI_File_set_view(fh, commrank*iosize, MPI_DOUBLE, MPI_DOUBLE, "native", info); Collectively write from all ranks. MPI_File_write_all(fh, dbuf, iosize/sizeof(double), MPI_DOUBLE, &status); Close the file from all ranks. MPI_File_close(&fh); 2011 HPCMP User Group © Cray Inc. June 20, 2011 290
  • 282.
     Several parallellibraries are available to provide a portable, metadata-rich file format  On Cray machines, it’s possible to set MPI-IO hints in your environment to improve out-of-the-box performance  HDF5 (http://www.hdfgroup.org/HDF5/)  Has long supported parallel file access  Currently in version 1.8  NetCDF (http://www.unidata.ucar.edu/software/netcdf/)  Multiple parallel implementations of NetCDF exist  Beginning with version 4.0, HDF5 is used under the hood to provide native support for parallel file access.  Currently inversion 4.0.  ADIOS ( http://adiosapi.org)  Fairly young library in development by ORNL, GA Tech, and others  Has a native file format, but also supports POSIX, NetCDF, HDF5, and other file formats  Version 1.0 was released at SC09. 2011 HPCMP User Group © Cray Inc. June 20, 2011 291
  • 283.
     Parallel Filesystems  Minimize contention for file system resources.  A process should not access more than one or two OSTs.  Ideally I/O Buffers and Filesystem “Chunk” sizes should match evenly to avoid locking  Performance  Performance is limited for single process I/O.  Parallel I/O utilizing a file-per-process or a single shared file is limited at large scales.  Potential solution is to utilize multiple shared file or a subset of processes which perform I/O.  Large buffer will generally perform best 292 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 284.
    Load the IOBUFmodule: % module load iobuf Relink the program. Set the IOBUF_PARAMS environment variable as needed. % setenv IOBUF_PARAMS='*:verbose‘ Execute the program.  IOBUF has a large number of options for tuning behavior from file to file. See man iobuf for details.  May significantly help codes that write a lot to stdout or stderr. 2011 HPCMP User Group © Cray Inc. June 20, 2011 293
  • 285.
     A particularcode both reads and writes a 377 GB file. Runs on 6000 cores.  Total I/O volume (reads and writes) is 850 GB.  Utilizes parallel HDF5  Default Stripe settings: count 4, size 1M, index -1.  1800 s run time (~ 30 minutes)  Stripe settings: count -1, size 1M, index -1.  625 s run time (~ 10 minutes)  Results  66% decrease in run time. Lustre 294 2011 HPCMP User Group © Cray Inc. June 20, 2011
  • 286.
     Included inthe Cray MPT library.  Environmental variable used to help MPI-IO optimize I/O performance.  MPICH_MPIIO_CB_ALIGN Environmental Variable. (Default 2)  MPICH_MPIIO_HINTS Environmental Variable  Can set striping_factor and striping_unit for files created with MPI-IO.  If writes and/or reads utilize collective calls, collective buffering can be utilized (romio_cb_read/write) to approximately stripe align I/O within Lustre. 295 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
  • 287.
    MPI-IO API ,non-power-of-2 blocks and transfers, in this case blocks and transfers both of 1M bytes and a strided access pattern. Tested on an XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220 segments, 96 GB file 1800 1600 MB/Sec 1400 1200 1000 800 600 400 200 0 2011 HPCMP User Group © Cray Inc. June 20, 2011 296
  • 288.
    MPI-IO API ,non-power-of-2 blocks and transfers, in this case blocks and transfers both of 10K bytes and a strided access pattern. Tested on an XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220 segments, 96 GB file 160 140 MB/Sec 120 100 80 60 40 20 0 2011 HPCMP User Group © Cray Inc. June 20, 2011 297
  • 289.
    On 5107 PEs,and by application design, a subset of the Pes(88), do the writes. With collective buffering, this is further reduced to 22 aggregators (cb_nodes) writing to 22 stripes. Tested on an XT5 with 5107 Pes, 8 cores/node 4000 3500 3000 2500 MB/Sec 2000 1500 1000 500 0 2011 HPCMP User Group © Cray Inc. June 20, 2011 298
  • 290.
    Total file size6.4 GiB. Mesh of 64M bytes 32M elements, with work divided amongst all PEs. Original problem was very poor scaling. For example, without collective buffering, 8000 PEs take over 5 minutes to dump. Note that disabling data sieving was necessary. Tested on an XT5, 8 stripes, 8 cb_nodes 1000 w/o CB 100 CB=0 Seconds CB=1 CB=2 10 1 PEs 2011 HPCMP User Group © Cray Inc. June 20, 2011 299
  • 291.
     Do notopen a lot of files all at once (Metadata Bottleneck)  Use a simple ls (without color) instead of ls -l (OST Bottleneck)  Remember to stripe files  Small, individual files => Small stripe counts  Large, shared files => Large stripe counts  Never set an explicit starting OST for your files (Filesystem Balance)  Open Files as Read-Only when possible  Limit the number of files per directory  Stat files from just one processes  Stripe-align your I/O (Reduces Locks)  Read small, shared files once and broadcast the data (OST Contention) 2011 HPCMP User Group © Cray Inc. June 20, 2011 300
  • 292.
     Adaptable IOSystem (ADIOS)  http://www.olcf.ornl.gov/center-projects/adios/  “Optimizing MPI-IO for Applications on Cray XT System” (CrayDoc S-0013- 10)  “A Pragmatic Approach to Improving the Large-scale Parallel I/O Performance of Scientic Applications.” Crosby, et al. (CUG 2011) 2011 HPCMP User Group © Cray Inc. June 20, 2011 301
  • 293.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 302
  • 294.
    2011 HPCMP UserGroup © Cray Inc. June 20, 2011 303