HPCMPUG2011 cray tutorial

 Review of XT6 Architecture
 AMD Opteron
 Cray Networks
 Lustre Basics
 Programming Environment
 PGI Compiler Basics
 The Cray Compiler Environment
 Cray Scientific Libraries
 Cray Message Passing Toolkit
 Cray Performance Analysis Tools
 ATP
 CCM
 Optimizations
 CPU
 Communication
 I/O
2011 HPCMP User Group © Cray Inc. June 20, 2011 2

AMD CPU Architecture
Cray Architecture
Lustre Filesystem Basics


2003 2005 2007 2008 2009 2010
AMD AMD
“Barcelona” “Shanghai” “Istanbul” “Magny-Cours”
Opteron™ Opteron™

Mfg.
130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI
Process
K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+
CPU Core

L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB

Hyper
Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s
Technology

Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333


12 cores
1.7-2.2Ghz
1 4 7 10 105.6Gflops

8 cores
5 11 1.8-2.4Ghz
2 8 76.8Gflops

Power (ACP)
3 6 9 12 80Watts

Stream
27.5GB/s

Cache
12x 64KB L1
12x 512KB L2
12MB L3


L3 cache

HT Link
HT Link
HT Link
HT Link

L2 cache L2 cache L2 cache L2 cache

MEMORY CONTROLLER
Core 2 MEMORY CONTROLLER Core 5 Core 8 Core 11

HT Link

HT Link
HT Link

HT Link

Core 1 Core 4 Core 7 Core 10


Core 0 Core 3 Core 6 Core 9


 A cache line is 64B
 Unique L1 and L2 cache attached to each core
 L1 cache is 64 kbytes
 L2 cache is 512 kbytes
 L3 Cache is shared between 6 cores
 Cache is a “victim cache”
 All loads go to L1 immediately and get evicted down the caches
 Hardware prefetcher detects forward and backward strides through
memory
 Each core can perform a 128b add and 128b multiply per clock cycle
 This requires SSE, packed instructions
 “Stride-one vectorization”
 6 cores share a “flat” memory
 Non-uniform-memory-access (NUMA) beyond a node

Processor Frequency Peak Bandwidth Balance
(Gflops) (GB/sec) (bytes/flop
)
Istanbul
2.6 62.4 12.8 0.21
(XT5)
2.0 64 42.6 0.67

MC-8 2.3 73.6 42.6 0.58

2.4 76.8 42.6 0.55

1.9 91.2 42.6 0.47

MC-12 2.1 100.8 42.6 0.42

2.2 105.6 42.6 0.40


Gemini (XE-series)


 Microkernel on Compute PEs,
full featured Linux on Service
PEs.
 Service PEs specialize by
function
Compute PE
 Software Architecture
Login PE eliminates OS “Jitter”
Network PE  Software Architecture enables
reproducible run times
System PE
 Large machines boot in under
I/O PE 30 minutes, including
filesystem
Service Partition
Specialized
Linux nodes


XE6
System

External
Login Server

Boot RAID 10 GbE
IB QDR


6.4 GB/sec direct connect
Characteristics HyperTransport
Number of 16 or 24 (MC)
Cores 32 (IL)
Peak 153 Gflops/sec
Performance
MC-8 (2.4)
Peak 211 Gflops/sec
Performance
MC-12 (2.2)
Memory Size 32 or 64 GB per
node
Memory 83.5 GB/sec
Bandwidth 83.5 GB/sec direct
connect memory
Cray
SeaStar2+
Interconnect


Greyhound Greyhound
Greyhound Greyhound DDR3 Channel
DDR3 Channel 6MB L3 HT3 6MB L3
Greyhound Greyhound
Cache Greyhound Cache Greyhound
Greyhound Greyhound

DDR3 Channel Greyhound Greyhound DDR3 Channel

HT3

HT3
Greyhound H Greyhound

DDR3 Channel 6MB L3
Greyhound
Greyhound
T3 6MB L3
Greyhound
Greyhound
DDR3 Channel

Cache Greyhound Cache Greyhound
Greyhound Greyhound
Greyhound HT3 Greyhound DDR3 Channel
DDR3 Channel

To Interconnect
HT1 / HT3
 2 Multi-Chip Modules, 4 Opteron Dies
 8 Channels of DDR3 Bandwidth to 8 DIMMs
 24 (or 16) Computational Cores, 24 MB of L3 cache
 Dies are fully connected with HT3
 Snoop Filter Feature Allows 4 Die SMP to scale well


Without snoop filter, a streams test
shows 25MB/sec out of a possible
51.2 GB/sec or 48% of peak
bandwidth


With snoop filter, a streams test
shows 42.3 MB/sec out of a
possible 51.2 GB/sec or 82% of
peak bandwidth

This feature will be key for two-
socket Magny Cours Nodes which
are the same architecture-wise

 New compute blade with 8 AMD
Magny Cours processors
 Plug-compatible with XT5 cabinets
and backplanes
 Upgradeable to AMD’s
“Interlagos” series
 XE6 systems ship with the current
SIO blade


 Supports 2 Nodes per ASIC
 168 GB/sec routing capacity
 Scales to over 100,000 network
endpoints
 Link Level Reliability and Adaptive Hyper Hyper
Routing Transport Transport
3 3
 Advanced Resiliency Features
 Provides global address NIC 0 Netlink NIC 1
SB
space Block Gemini
LO
 Advanced NIC designed to Processor
efficiently support 48-Port
 MPI YARC Router
 One-sided MPI
 Shmem
 UPC, Coarray FORTRAN

Cray Baker Node
Characteristics
Number of 16 or 24
10 12X Gemini Cores
Channels
Peak 140 or 210 Gflops/s
(Each Gemini High Radix
YARC Router
Performance
acts like two
nodes on the 3D with adaptive Memory Size 32 or 64 GB per
Torus) Routing
node
168 GB/sec
capacity Memory 85 GB/sec
Bandwidth

Module with
SeaStar

Z

Y

X

Module with
Gemini

net rsp
net req

LB
ht treq p
net LB Ring
ht treq np FMA req T net
ht trsp net net
A req S req req vc0
ht p req
net R S
req O
ht np req B I
BTE R net
D rsp
B vc1

Router Tiles
HT3 Cave

NL
ht irsp
NPT vc1

ht np net
ireq rsp
net req CQ NAT
ht np req
H
ht p req net rsp headers
ht p
A AMO net
ht p req
ireq R net req req net req vc0
B RMT
ht p req
RAT net rsp

LM
CLM

 FMA (Fast Memory Access)
 Mechanism for most MPI transfers
 Supports tens of millions of MPI requests per second
 BTE (Block Transfer Engine)
 Supports asynchronous block transfers between local and remote memory,
in either direction
 For use for large MPI transfers that happen in the background

 Two Gemini ASICs are
packaged on a pin-compatible
mezzanine card
 Topology is a 3-D torus
 Each lane of the torus is
composed of 4 Gemini router
“tiles”
 Systems with SeaStar
interconnects can be upgraded
by swapping this card
 100% of the 48 router tiles on
each Gemini chip are used


Name Architecture Processor Network # Cores Memory/Core

Jade XT-4 AMD Seastar 2.1 8584 2GB DDR2-800
Budapest (2.1
Ghz)
Einstein XT-5 AMD Seastar 2.1 12827 2GB (some
Shanghai (2.4 nodes have
Ghz) 4GB/core)
DDR2-800
MRAP XT-5 AMD Seastar 2.1 10400 4GB DDR2-800
Barcelona (2.3
Ghz)
Garnet XE-6 Magny Cours Gemini 1.0 20160 2GB DDR3-1333
8 core 2.4 Ghz
Raptor XE-6 Magny Cours Gemini 1.0 43712 2GB DDR3-1333
8 core 2.4 Ghz
Chugach XE-6 Magny Cours Gemini 1.0 11648 2GB DDR3 -1333
8 core 2.3 Ghz


Low Velocity Airflow

High Velocity Airflow

Low Velocity Airflow

High Velocity Airflow

2011 HPCMP User Group © Cray Inc.
June 20, 2011
33 Low Velocity Airflow

Cool air is released into the computer room

Liquid Liquid/Vapor
in Mixture out

Hot air stream passes through evaporator, rejects
heat to R134a via liquid-vapor phase change
(evaporation).

R134a absorbs energy only in the presence of heated air.
Phase change is 10x more efficient than pure water
cooling.


R134a piping Exit Evaporators

Inlet Evaporator


Term Meaning Purpose
MDS Metadata Server Manages all file metadata for
filesystem. 1 per FS
OST Object Storage Target The basic “chunk” of data written
to disk. Max 160 per file.
OSS Object Storage Server Communicates with disks,
manages 1 or more OSTs. 1 or
more per FS
Stripe Size Size of chunks. Controls the size of file chunks
stored to OSTs. Can’t be changed
once file is written.
Stripe Count Number of OSTs used per Controls parallelism of file. Can’t
file. be changed once file is writte.


2011 HPCMP User Group © Cray Inc. une 20, 2011
J 38

2011 HPCMP User Group © Cray Inc. une 20, 2011
J 39

 32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size
 Unable to take advantage of file system parallelism
 Access to multiple disks adds overhead which hurts performance

Single Writer
Write Performance
120

100

80
Write (MB/s)

1 MB Stripe
60
32 MB Stripe

40
Lustre
20

0
1 2 4 16 32 64 128 160
Stripe Count

40 2011 HPCMP User Group © Cray Inc. une 20, 2011
J

 Single OST, 256 MB File Size
 Performance can be limited by the process (transfer size) or file system
(stripe size)

Single Writer
Transfer vs. Stripe Size
140

120

100
Write (MB/s)

80
32 MB Transfer
60
8 MB Transfer
1 MB Transfer
40 Lustre
20

0
1 2 4 8 16 32 64 128
Stripe Size (MB)

J

 Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and
possibly size
 lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)
 lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)
 export MPICH_MPIIO_HINTS=‘*: striping_factor=160’
 Files inherit striping information from the parent directory, this cannot be
changed once the file is written
 Set the striping before copying in files


Available Compilers
Cray Scientific Libraries
Cray Message Passing Toolkit


 Cray XT/XE Supercomputers come with compiler wrappers to simplify
building parallel applications (similar the mpicc/mpif90)
 Fortran Compiler: ftn
 C Compiler: cc
 C++ Compiler: CC
 Using these wrappers ensures that your code is built for the compute
nodes and linked against important libraries
 Cray MPT (MPI, Shmem, etc.)
 Cray LibSci (BLAS, LAPACK, etc.)
 …
 Choosing the underlying compiler is via the PrgEnv-* modules, do not call
the PGI, Cray, etc. compilers directly.
 Always load the appropriate xtpe-<arch> module for your machine
 Enables proper compiler target
 Links optimized math libraries


…from Cray’s Perspective

 PGI – Very good Fortran and C, pretty good C++
 Good vectorization
 Good functional correctness with optimization enabled
 Good manual and automatic prefetch capabilities
 Very interested in the Linux HPC market, although that is not their only focus
 Excellent working relationship with Cray, good bug responsiveness
 Pathscale – Good Fortran, C, possibly good C++
 Outstanding scalar optimization for loops that do not vectorize
 Fortran front end uses an older version of the CCE Fortran front end
 OpenMP uses a non-pthreads approach
 Scalar benefits will not get as much mileage with longer vectors
 Intel – Good Fortran, excellent C and C++ (if you ignore vectorization)
 Automatic vectorization capabilities are modest, compared to PGI and CCE
 Use of inline assembly is encouraged
 Focus is more on best speed for scalar, non-scaling apps
 Tuned for Intel architectures, but actually works well for some applications on
AMD

…from Cray’s Perspective

 GNU so-so Fortran, outstanding C and C++ (if you ignore vectorization)
 Obviously, the best for gcc compatability
 Scalar optimizer was recently rewritten and is very good
 Vectorization capabilities focus mostly on inline assembly
 Note the last three releases have been incompatible with each other (4.3, 4.4,
and 4.5) and required recompilation of Fortran modules
 CCE – Outstanding Fortran, very good C, and okay C++
 Very good vectorization
 Very good Fortran language support; only real choice for Coarrays
 C support is quite good, with UPC support
 Very good scalar optimization and automatic parallelization
 Clean implementation of OpenMP 3.0, with tasks
 Sole delivery focus is on Linux-based Cray hardware systems
 Best bug turnaround time (if it isn’t, let us know!)
 Cleanest integration with other Cray tools (performance tools, debuggers,
upcoming productivity tools)
 No inline assembly support


 PGI
 -fast –Mipa=fast(,safe)
 If you can be flexible with precision, also try -Mfprelaxed
 Compiler feedback: -Minfo=all -Mneginfo
 man pgf90; man pgcc; man pgCC; or pgf90 -help
 Cray
 <none, turned on by default>
 Compiler feedback: -rm (Fortran) -hlist=m (C)
 If you know you don’t want OpenMP: -xomp or -Othread0
 man crayftn; man craycc ; man crayCC
 Pathscale
 -Ofast Note: this is a little looser with precision than other compilers
 Compiler feedback: -LNO:simd_verbose=ON
 man eko (“Every Known Optimization”)
 GNU
 -O2 / -O3
 Compiler feedback: good luck
 man gfortran; man gcc; man g++
 Intel
 -fast
 Compiler feedback:
 man ifort; man icc; man iCC


 Traditional (scalar) optimizations are controlled via -O# compiler flags
 Default: -O2
 More aggressive optimizations (including vectorization) are enabled with
the -fast or -fastsse metaflags
 These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre
–Mautoinline -Mvect=sse -Mscalarsse
-Mcache_align -Mflushz –Mpre
 Interprocedural analysis allows the compiler to perform whole-program
optimizations. This is enabled with –Mipa=fast
 See man pgf90, man pgcc, or man pgCC for more information about
compiler options.


 Compiler feedback is enabled with -Minfo and -Mneginfo
 This can provide valuable information about what optimizations were
or were not done and why.
 To debug an optimized code, the -gopt flag will insert debugging
information without disabling optimizations
 It’s possible to disable optimizations included with -fast if you believe one
is causing problems
 For example: -fast -Mnolre enables -fast and then disables loop
redundant optimizations
 To get more information about any compiler flag, add -help with the
flag in question
 pgf90 -help -fast will give more information about the -fast
flag
 OpenMP is enabled with the -mp flag


Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.

 -Kieee: All FP math strictly conforms to IEEE 754 (off by default)
 -Ktrap: Turns on processor trapping of FP exceptions
 -Mdaz: Treat all denormalized numbers as zero
 -Mflushz: Set SSE to flush-to-zero (on with -fast)
 -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to
speed up some floating point optimizations
 Some other compilers turn this on by default, PGI chooses to favor
accuracy to speed by default.


 Cray has a long tradition of high performance compilers on Cray
platforms (Traditional vector, T3E, X1, X2)
 Vectorization
 Parallelization
 Code transformation
 More…
 Investigated leveraging an open source compiler called LLVM

 First release December 2008


Fortran Source C and C++ Source C and C++ Front End
supplied by Edison Design
Group, with Cray-developed
Fortran Front End C & C++ Front End code for extensions and
interface support

Interprocedural Analysis
Cray Inc. Compiler
Technology
Compiler

Optimization and
Parallelization

X86 Code Cray X2 Code
Generator Generator
X86 Code Generation from
Open Source LLVM, with
Object File
additional Cray-developed
optimizations and interface
support

 Standard conforming languages and programming models
 Fortran 2003
 UPC & CoArray Fortran
 Fully optimized and integrated into the compiler
 No preprocessor involved
 Target the network appropriately:
 GASNet with Portals
 DMAPP with Gemini & Aries

 Ability and motivation to provide high-quality support for custom
Cray network hardware
 Cray technology focused on scientific applications
 Takes advantage of Cray’s extensive knowledge of automatic
vectorization
 Takes advantage of Cray’s extensive knowledge of automatic
shared memory parallelization
 Supplements, rather than replaces, the available compiler
choices

 Make sure it is available
 module avail PrgEnv-cray
 To access the Cray compiler
 module load PrgEnv-cray
 To target the various chip
 module load xtpe-[barcelona,shanghi,mc8]
 Once you have loaded the module “cc” and “ftn” are the Cray
compilers
 Recommend just using default options
 Use –rm (fortran) and –hlist=m (C) to find out what happened
 man crayftn


 Excellent Vectorization
 Vectorize more loops than other compilers
 OpenMP 3.0
 Task and Nesting
 PGAS: Functional UPC and CAF available today
 C++ Support
 Automatic Parallelization
 Modernized version of Cray X1 streaming capability
 Interacts with OMP directives
 Cache optimizations
 Automatic Blocking
 Automatic Management of what stays in cache
 Prefetching, Interchange, Fusion, and much more…


 Loop Based Optimizations
 Vectorization
 OpenMP
 Autothreading
 Interchange
 Pattern Matching
 Cache blocking/ non-temporal / prefetching
 Fortran 2003 Standard; working on 2008
 PGAS (UPC and Co-Array Fortran)
 Some performance optimizations available in 7.1
 Optimization Feedback: Loopmark
 Focus


 Cray compiler supports a full and growing set of directives
and pragmas

!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable

man directives
man loop_info

 Compiler can generate an filename.lst file.
 Contains annotated listing of your source code with letter indicating important
optimizations
%%% L o o p m a r k L e g e n d %%%
Primary Loop Type Modifiers
------- ---- ---- ---------
a - vector atomic memory operation
A - Pattern matched b - blocked
C - Collapsed f - fused
D - Deleted i - interchanged
E - Cloned m - streamed but not partitioned
I - Inlined p - conditional, partial and/or computed
M - Multithreaded r - unrolled
P - Parallel/Tasked s - shortloop
V - Vectorized t - array syntax temp used
W - Unwound w - unwound


• ftn –rm … or cc –hlist=m …
29. b-------< do i3=2,n3-1
30. b b-----< do i2=2,n2-1
31. b b Vr--< do i1=1,n1
32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr--> enddo
37. b b Vr--< do i1=2,n1-1
38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr > - a(0) * u(i1,i2,i3)
40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr--> enddo
43. b b-----> enddo
44. b-------> enddo


ftn-6289 ftn: VECTOR File = resid.f, Line = 29
A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines
32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
A loop starting at line 29 was blocked with block size 4.
A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32
and 38.
A loop starting at line 30 was blocked with block size 4.
A loop starting at line 31 was unrolled 4 times.
A loop starting at line 31 was vectorized.
A loop starting at line 37 was unrolled 4 times.
A loop starting at line 37 was vectorized.

 -hbyteswapio
 Link time option
 Applies to all unformatted fortran IO
 Assign command
 With the PrgEnv-cray module loaded do this:
setenv FILENV assign.txt
assign -N swap_endian g:su
assign -N swap_endian g:du

 Can use assign to be more precise


 OpenMP is ON by default
 Optimizations controlled by –Othread#
 To shut off use –Othread0 or –xomp or –hnoomp

 Autothreading is NOT on by default;
 -hautothread to turn on
 Modernized version of Cray X1 streaming capability
 Interacts with OMP directives

If you do not want to use OpenMP and have OMP directives
in the code, make sure to make a run with OpenMP shut
off at compile time


 Cray have historically played a role in scientific library
development
 BLAS3 were largely designed for Crays
 Standard libraries were tuned for Cray vector processors
(later COTS)
 Cray have always tuned standard libraries for Cray
interconnect
 In the 90s, Cray provided many non-standard libraries
 Sparse direct, sparse iterative
 These days the goal is to remain portable (standard APIs)
whilst providing more performance
 Advanced features, tuning knobs, environment variables


FFT Dense Sparse
BLAS
CRAFFT CASK
LAPACK

FFTW ScaLAPACK PETSc

IRT
P-CRAFFT Trilinos
CASE

IRT – Iterative Refinement Toolkit
CASK – Cray Adaptive Sparse Kernels
CRAFFT – Cray Adaptive FFT
CASE – Cray Adaptive Simple Eigensolver

 There are many libsci libraries on the systems
 One for each of
 Compiler (intel, cray, gnu, pathscale, pgi )
 Single thread, multiple thread
 Target (istanbul, mc12 )
 Best way to use libsci is to ignore all of this
 Load the xtpe-module (some sites set this by default)
 E.g. module load xtpe-shanghai / xtpe-istanbul / xtpe-mc8
 Cray’s drivers will link the library automatically
 PETSc, Trilinos, fftw, acml all have their own module
 Tip : make sure you have the correct library loaded e.g.
–Wl, -ydgemm_

 Perhaps you want to link another library such as ACML
 This can be done. If the library is provided by Cray, then load
the module. The link will be performed with the libraries in the
correct order.
 If the library is not provided by Cray and has no module, add it
to the link line.
 Items you add to the explicit link will be in the correct place
 Note, to get explicit BLAS from ACML but scalapack from libsci
 Load acml module. Explicit calls to BLAS in code resolve
from ACML
 BLAS calls from the scalapack code will be resolved from
libsci (no way around this)

 Threading capabilities in previous libsci versions were poor
 Used PTHREADS (more explicit affinity etc)
 Required explicit linking to a _mp version of libsci
 Was a source of concern for some applications that need
hybrid performance and interoperability with openMP
 LibSci 10.4.2 February 2010
 OpenMP-aware LibSci
 Allows calling of BLAS inside or outside parallel region
 Single library supported (there is still a single thread lib)
 Usage – load the xtpe module for your system (mc12)

GOTO_NUM_THREADS outmoded – use OMP_NUM_THREADS


 Allows seamless calling of the BLAS within or without a parallel
region

e.g. OMP_NUM_THREADS = 12

call dgemm(…) threaded dgemm is used with 12 threads
!$OMP PARALLEL DO
do
call dgemm(…) single thread dgemm is used
end do

Some users are requesting a further layer of parallelism here (see
later)


120

Libsci DGEMM efficiency
100

80
GFLOPs

1thread
60
3threads
6threads
40 9threads
12threads

20

0

Dimension (square) Inc.
2011 HPCMP User Group © Cray June 20, 2011 74

140
Libsci-10.5.2 performance on 2 x MC12 2.0 GHz K=64
120 (Cray XE6)
K=128

100 K=200

K=228
80
GFLOPS

K=256

60 K=300

K=400
40
K=500
20
K=600

0 K=700
1 2 4 8 12 16 20 24
K=800
Number of threads

 All BLAS libraries are optimized for rank-k update

* =

 However, a huge % of dgemm usage is not from solvers but explicit calls
 E.g. DCA++ matrices are of this form

* =

 How can we very easily provide an optimization for these types of
matrices?

 Cray BLAS existed on every Cray machine between Cray-2 and Cray
X2
 Cray XT line did not include Cray BLAS
 Cray’s expertise was in vector processors
 GotoBLAS was the best performing x86 BLAS
 LibGoto is now discontinued
 In Q3 2011 LibSci will be released with Cray BLAS


1. Customers require more OpenMP features unobtainable
with current library
2. Customers require more adaptive performance for
unusual problems .e.g. DCA++
3. Interlagos / Bulldozer is a dramatic shift in
ISA/architecture/performance
4. Our auto-tuning framework has advanced to the point
that we can tackle this problem (good BLAS is easy,
excellent BLAS is very hard)
5. Need for Bit-reproducable BLAS at high-performance


"anything that can be represented in C, Fortran or ASM
code can be generated automatically by one instance
of an abstract operator in high-level code“

In other words, if we can create a purely general model
of matrix-multiplication, and create every instance of
it, then at least one of the generated schemes will
perform well


 Start with a completely general formulation of the BLAS
 Use a DSL that expresses every important optimization
 Auto-generate every combination of orderings, buffering, and
optimization
 For every combination of the above, sweep all possible sizes
 For a given input set ( M, N, K, datatype, alpha, beta ) map the
best dgemm routine to the input
 The current library should be a specific instance of the above
 Worst-case performance can be no worse than current library
 The lowest level of blocking is a hand-written assembly kernel


7.5

7.45

7.4

7.35

7.3
bframe GFLOPS
7.25
libsci
7.2

7.15

7.1

7.05

143
72
12
17
22
27

62

133
67
37
42

57

105
2
7

47

100

128

138
95
32

52


 New optimizations for Gemini network in the ScaLAPACK LU and Cholesky
routines

1. Change the default broadcast topology to match the Gemini network

2. Give tools to allow the topology to be changed by the user

3. Give guidance on how grid-shape can affect the performance


 Parallel Version of LAPACK GETRF
 Panel Factorization
 Only single column block is involved
 The rest of PEs are waiting
 Trailing matrix update
 Major part of the computation
 Column-wise broadcast (Blocking)
 Row-wise broadcast (Asynchronous)
 Data is packed before sending using PBLAS
 Broadcast uses BLACS library
 These broadcasts are the major communication
patterns

 MPI default
 Binomial Tree + node-aware broadcast
 All PEs makes implicit barrier to make sure the completion
 Not suitable for rank-k update

 Bidirectional-Ring broadcast
 Root PE makes 2 MPI Send calls to both of the directions
 The immediate neighbor finishes first
 ScaLAPACK’s default
 Better than MPI


 Increasing Ring Broadcast (our new default)
 Root makes a single MPI call to the immediate neighbor
 Pipelining
 Better than bidirectional ring

 Multi-Ring Broadcast (2, 4, 8 etc)
 The root PE sends to multiple sub-rings
 Can be done with tree algorithm

 2 rings seems the best for row-wise broadcast of LU


 Hypercube
 Behaves like MPI default
 Too many collisions in the message traffic
 Decreasing Ring
 The immediate neighbor finishes last
 No benefit in LU
 Modified Increasing Ring
 Best performance in HPL
 As good as increasing ring


XDLU performance: 3072 cores, size=65536
10000
9000
8000
7000
6000
Gflops

5000
4000
3000 SRING
IRING
2000
1000
0
32 64 32 64 32 64 32 64 32 64

48 48 24 24 12 12 32 32 16 16

64 64 128 128 256 256 96 96 192 192
NB / P / Q

XDLU performance: 6144 cores, size=65536
14000

12000

10000

8000
Gflops

6000
SRING
4000
IRING
2000

0
32 64 32 64 32 64 32 64 32 64

48 48 24 24 12 12 64 64 32 32

128 128 256 256 512 512 96 96 192 192
NB / P / Q

 Row Major Process Grid puts adjacent PEs in the same row
 Adjacent PEs are most probably located in the same node
 In flat MPI, 16 or 24 PEs are in the same node
 In hybrid mode, several are in the same node
 Most MPI sends in I-ring happen in the same node
 MPI has good shared-memory device
 Good pipelining

Node 0 Node 1 Node 2


 For PxGETRF:  The variables let users to choose
 SCALAPACK_LU_CBCAST broadcast algorithm :
 SCALAPACK_LU_RBCAST  IRING increasing ring
 For PxPOTRF:
(default value)
 DRING decreasing ring
 SCALAPACK_LLT_CBCAST
 SRING split ring (old default
 SCALAPCK_LLT_RBCAST
value)
 SCALAPACK_UTU_CBCAST
 MRING multi-ring
SCALAPACK_UTU_RBCAST
 HYPR hypercube
 MPI mpi_bcast
 TREE tree
 There is also a set function, allowing
 FULL full connected
the user to change these on the fly


 Grid shape / size
 Square grid is most common
 Try to use Q = x * P grids, where x = 2, 4, 6, 8
 Square grids not often the best
 Blocksize
 Unlike HPL, fine-tuning not important.
 64 usually the best
 Ordering
 Try using column-major ordering, it can be better
 BCAST
 The new default will be a huge improvement if you can make your grid
the right way. If you cannot, play with the environment variables.


 Full MPI2 support (except process spawning) based on ANL MPICH2
 Cray used the MPICH2 Nemesis layer for Gemini
 Cray-tuned collectives
 Cray-tuned ROMIO for MPI-IO

 Current Release: 5.3.0 (MPICH 1.3.1)
 Improved MPI_Allreduce and MPI_alltoallv
 Initial support for checkpoint/restart for MPI or Cray SHMEM on XE
systems
 Improved support for MPI thread safety.
 module load xt-mpich2
 Tuned SHMEM library
 module load xt-shmem


MPI_Alltoall with 10,000 Processes
Comparing Original vs Optimized Algorithms
on Cray XE6 Systems
25000000

20000000
Microseconds

15000000

Original Algorithm
10000000
Optimized Algorithm

5000000

0
256 512 1024 2048 4096 8192 16384 32768
MessageHPCMP User Group © Cray Inc.
2011
Size (in bytes) June 20, 2011 95

8-Byte MPI_Allgather and MPI_Allgatherv Scaling
Comparing Original vs Optimized Algorithms
45000 on Cray XE6 Systems
40000
MPI_Allgather and
35000 MPI_Allgatherv algorithms
optimized for Cray XE6.
30000
Microseconds

Original Allgather
25000
20000 Optimized Allgather

15000 Original Allgatherv

10000 Optimized Allgatherv

5000
0
1024p 2048p 4096p 8192p 16384p 32768p
Number ofUser Group © Cray Inc. June 20, 2011
2011 HPCMP Processes 96

 Default is 8192 bytes
 Maximum size message that can go through the eager protocol.
 May help for apps that are sending medium size messages, and do better
when loosely coupled. Does application have a large amount of time in
MPI_Waitall? Setting this environment variable higher may help.
 Max value is 131072 bytes.
 Remember for this path it helps to pre-post receives if possible.
 Note that a 40-byte CH3 header is included when accounting for the
message size.


 Default is 64 32K buffers ( 2M total )
 Controls number of 32K DMA buffers available for each rank to use in the
Eager protocol described earlier
 May help to modestly increase. But other resources constrain the usability
of a large number of buffers.


 What do I mean by PGAS?
 Partitioned Global Address Space
 UPC
 CoArray Fortran ( Fortran 2008 )
 SHMEM (I will count as PGAS for convenience)
 SHMEM: Library based
 Not part of any language standard
 Compiler independent
 Compiler has no knowledge that it is compiling a PGAS code and
does nothing different. I.E. no transformations or optimizations


 UPC
 Specification that extends the ISO/IEC 9899 standard for C
 Integrated into the language
 Heavily compiler dependent
 Compiler intimately involved in detecting and executing remote
references
 Flexible, but filled with challenges like pointers, a lack of true
multidimensional arrays, and many options for distributing data
 Fortran 2008
 Now incorporates coarrays
 Compiler dependent
 Philosophically different from UPC
 Replication of arrays on every image with “easy and obvious” ways
to access those remote locations.


 Translate the UPC source code into hardware executable operations that
produce the proper behavior, as defined by the specification
 Storing to a remote location?
 Loading from a remote location?
 When does the transfer need to be complete?
 Are there any dependencies between this transfer and anything else?
 No ordering guarantees provided by the network, compiler is
responsible for making sure everything gets to its destination in the
correct order.


for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) {
local_data[i] += global_2d[i][target];
}

temp = pgas_get(&global_2d[i]); // Initiate the get
pgas_fence(); // makes sure the get is complete
local_data[i] += temp; // Use the local location to complete the operation
}
 The compiler must
 Recognize you are referencing a shared location
 Initiate the load of the remote data
 Make sure the transfer has completed
 Proceed with the calculation
 Repeat for all iterations of the loop


temp = pgas_get(&global_2d[i]); // Initiate the get
pgas_fence(); // makes sure the get is complete
local_data[i] += temp; // Use the local location to complete the operation
}

 Simple translation results in
 Single word references
 Lots of fences
 Little to no latency hiding
 No use of special hardware
 Nothing here says “fast”

2011 HPCMP User 105
June 20, 2011 Group © Cray Inc.

Want the compiler to generate code that will run as fast as possible given
what the user has written, or allow the user to get fast performance with
simple modifications.
 Increase message size
 Do multi / many word transfers whenever possible, not single word.
 Minimize fences
 Delay fence “as much as possible”
 Eliminate the fence in some circumstances
 Use the appropriate hardware
 Use on-node hardware for on-node transfers
 Use transfer mechanism appropriate for this message size
 Overlap communication and computation
 Use hardware atomic functions where appropriate


Primary Loop Type Modifiers
A - Pattern matched a - atomic memory operation
b - blocked
C - Collapsed c - conditional and/or computed
D - Deleted
E - Cloned f - fused
G - Accelerated g - partitioned
I - Inlined i - interchanged
M - Multithreaded m - partitioned
n - non-blocking remote transfer
p - partial
r - unrolled
s - shortloop
V - Vectorized w - unwound


15. shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS];
…
83. 1 before = upc_ticks_now();
84. 1 r8------< for ( i = 0, j = target; i < ELEMS_PER_THREAD ;
85. 1 r8 i += 1, j += THREADS ) {
86. 1 r8 n local_data[i]= global_1d[j];
87. 1 r8------> }
88. 1 after = upc_ticks_now();

 1D get BW= 0.027598 Gbytes/s

2011 HPCMP User 109

15. shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS];
…
101. 1 before = upc_ticks_now();
102. 1 upc_memget(&local_data[0],&global_1d[target],8*ELEMS_PER_THREAD);
103. 1
104. 1 after = upc_ticks_now();

 1D upc_memget BW= 4.972960 Gbytes/s

 upc_memget is 184 times faster!!

2011 HPCMP User 110

16. shared long global_2d[MAX_ELEMS_PER_THREAD][THREADS];
…
121. 1 A-------< for ( i = 0; i < ELEMS_PER_THREAD; i+=1) {
122. 1 A local_data[i] = global_2d[i][target];
123. 1 A-------> }

 1D upc_memget BW= 4.972960 Gbytes/s
 2D get time BW= 4.905653 Gbytes/s
 Pattern matching can give you the same
performance as if using upc_memget

2011 HPCMP User 111

 PGAS data references made by the single statement immediately following the pgas
defer_sync directive will not be synchronized until the next fence instruction.
 Only applies to next UPC/CAF statement
 Does not apply to upc “routines”
 Does not apply to shmem routines

 Normally the compiler synchronizes the references in a statement as late as
possible without violating program semantics. The purpose of the defer_sync
directive is to synchronize the references even later, beyond where the compiler
can determine it is safe.

 Extremely powerful!
 Can easily overlap communication and computation with this statement
 Can apply to both “gets” and “puts”
 Can be used to implement a variety of “tricks”. Use your imagination!


CrayPAT


 Future system basic characteristics:
 Many-core, hybrid multi-core computing

 Increase in on-node concurrency
 10s-100s of cores sharing memory
 With or without a companion accelerator
 Vector hardware at the low level

 Impact on applications:
 Restructure / evolve applications while using existing programming
models to take advantage of increased concurrency

 Expand on use of mixed-mode programming models (MPI + OpenMP +
accelerated kernels, etc.)


 Focus on automation (simplify tool usage, provide feedback based on
analysis)

 Enhance support for multiple programming models within a program (MPI,
PGAS, OpenMP, SHMEM)

 Scaling (larger jobs, more data, better tool response)

 New processors and interconnects

 Extend performance tools to include pre-runtime optimization information
from the Cray compiler


 New predefined wrappers (ADIOS, ARMCI, PetSc, PGAS libraries)
 More UPC and Co-array Fortran support
 Support for non-record locking file systems
 Support for applications built with shared libraries
 Support for Chapel programs
 pat_report tables available in Cray Apprentice2


 Enhanced PGAS support is available in perftools 5.1.3 and later
 Profiles of a PGAS program can be created to show:
 Top time consuming functions/line numbers in the code
 Load imbalance information
 Performance statistics attributed to user source by default
 Can expose statistics by library as well
 To see underlying operations, such as wait time on barriers
 Data collection is based on methods used for MPI library
 PGAS data is collected by default when using Automatic Profiling Analysis
(pat_build –O apa)
 Predefined wrappers for runtime libraries (caf, upc, pgas) enable attribution of
samples or time to user source
 UPC and SHMEM heap tracking coming in subsequent release
 -g heap will track shared heap in addition to local heap

June 20, 2011 2011 HPCMP User Group © Cray Inc. 118

Table 1: Profile by Function

Samp % | Samp | Imb. | Imb. |Group
| | Samp | Samp % | Function
| | | | PE='HIDE'

100.0% | 48 | -- | -- |Total
|------------------------------------------
| 95.8% | 46 | -- | -- |USER
||-----------------------------------------
|| 83.3% | 40 | 1.00 | 3.3% |all2all
|| 6.2% | 3 | 0.50 | 22.2% |do_cksum
|| 2.1% | 1 | 1.00 | 66.7% |do_all2all
|| 2.1% | 1 | 0.50 | 66.7% |mpp_accum_long
|| 2.1% | 1 | 0.50 | 66.7% |mpp_alloc
||=========================================
| 4.2% | 2 | -- | -- |ETC
||-----------------------------------------
|| 4.2% | 2 | 0.50 | 33.3% |bzero
|==========================================


Table 2: Profile by Group, Function, and Line

Samp % | Samp | Imb. | Imb. |Group
| | Samp | Samp % | Function
| | | | Source
| | | | Line
| | | | PE='HIDE'

100.0% | 48 | -- | -- |Total
|--------------------------------------------
| 95.8% | 46 | -- | -- |USER
||-------------------------------------------
|| 83.3% | 40 | -- | -- |all2all
3| | | | | mpp_bench.c
4| | | | | line.298
|| 6.2% | 3 | -- | -- |do_cksum
3| | | | | mpp_bench.c
||||-----------------------------------------
4||| 2.1% | 1 | 0.25 | 33.3% |line.315
4||| 4.2% | 2 | 0.25 | 16.7% |line.316
||||=========================================


Table 1: Profile by Function and Callers, with Line Numbers
Samp % | Samp |Group
| | Function
| | Caller
| | PE='HIDE’
100.0% | 47 |Total
|---------------------------
| 93.6% | 44 |ETC
||--------------------------
|| 85.1% | 40 |upc_memput
3| | | all2all:mpp_bench.c:line.298
4| | | do_all2all:mpp_bench.c:line.348
5| | | main:test_all2all.c:line.70
|| 4.3% | 2 |bzero
3| | | (N/A):(N/A):line.0
|| 2.1% | 1 |upc_all_alloc
3| | | mpp_alloc:mpp_bench.c:line.143
|| 2.1% | 1 |upc_all_reduceUL
3| | | mpp_accum_long:mpp_bench.c:line.185
4| | | do_cksum:mpp_bench.c:line.317
5| | | do_all2all:mpp_bench.c:line.341
||==========================


Table 1: Profile by Function and Callers, with Line Numbers

Time % | Time | Calls |Group
| | | Function
| | | Caller
| | | PE='HIDE'

100.0% | 0.795844 | 73904.0 |Total
|-----------------------------------------
| 78.9% | 0.628058 | 41121.8 |PGAS
||----------------------------------------
|| 76.1% | 0.605945 | 32768.0 |__pgas_put
3| | | | all2all:mpp_bench.c:line.298
4| | | | do_all2all:mpp_bench.c:line.348
5| | | | main:test_all2all.c:line.70
|| 1.5% | 0.012113 | 10.0 |__pgas_barrier
3| | | | (N/A):(N/A):line.0
…


…
||========================================
| 15.7% | 0.125006 | 3.0 |USER
||----------------------------------------
|| 12.2% | 0.097125 | 1.0 |do_all2all
|| 3.5% | 0.027668 | 1.0 |main
3| | | | (N/A):(N/A):line.0
||========================================
| 5.4% | 0.042777 | 32777.2 |UPC
||----------------------------------------
|| 5.3% | 0.042321 | 32768.0 |upc_memput
3| | | | all2all:mpp_bench.c:line.298
4| | | | do_all2all:mpp_bench.c:line.348
|=========================================


New text
table icon

Right click
for table
generation
options


 Scalability
 New .ap2 data format and client / server model
 Reduced pat_report processing and report generation times
 Reduced app2 data load times
 Graphical presentation handled locally (not passed through ssh
connection)
 Better tool responsiveness
 Minimizes data loaded into memory at any given time
 Reduced server footprint on Cray XT/XE service node
 Larger jobs supported

 Distributed Cray Apprentice2 (app2) client for Linux
 app2 client for Mac and Windows laptops coming later this year


 CPMD
 MPI, instrumented with pat_build –u, HWPC=1
 960 cores
Perftools 5.1.3 Perftools 5.2.0
.xf -> .ap2 88.5 seconds 22.9 seconds
ap2 -> report 1512.27 seconds 49.6 seconds

 VASP
 MPI, instrumented with pat_build –gmpi –u, HWPC=3
 768 cores

Perftools 5.1.3 Perftools 5.2.0
.xf -> .ap2 45.2 seconds 15.9 seconds
ap2 -> report 796.9 seconds 28.0 seconds


‘:’ signifies
 From Linux desktop – a remote
host
instead of
 % module load perftools
ap2 file

 % app2
 % app2 kaibab:
 % app2 kaibab:/lus/scratch/heidi/swim+pat+10302-0t.ap2

 File->Open Remote…


 Optional app2 client for Linux desktop available as of 5.2.0

 Can still run app2 from Cray service node

 Improves response times as X11 traffic is no longer passed through the ssh
connection

 Replaces 32-bit Linux desktop version of Cray Apprentice2

 Uses libssh to establish connection

 app2 clients for Windows and Mac coming in subsequent release


Linux desktop All data from Cray XT login Collected Compute nodes
my_program.ap2 + performance
X Window
X11 protocol data
app2
System
application my_program.ap2 my_program+apa

 Log into Cray XT/XE login node
 % ssh –Y seal

 Launch Cray Apprentice2 on Cray XT/XE login node
 % app2 /lus/scratch/mydir/my_program.ap2
 User Interface displayed on desktop via ssh trusted X11 forwarding
 Entire my_program.ap2 file loaded into memory on XT login node (can
be Gbytes of data)

Linux desktop User requested data Cray XT login Collected
Compute nodes
from
X Window performance
my_program.ap2 app2 server
System data
application
my_program.ap2
my_program+apa
app2 client

 Launch Cray Apprentice2 on desktop, point to data
 % app2 seal:/lus/scratch/mydir/my_program.ap2

 User Interface displayed on desktop via X Windows-based software
 Minimal subset of data from my_program.ap2 loaded into memory on
Cray XT/XE service node at any given time
 Only data requested sent from server to client


 Major change to the way HW counters are collected starting with CPMAT
5.2.1 and CLE 4.0 (In conjunction with Interlagos support)

 Linux has officially incorporated support for accessing counters through a
perf_events subsystem. Until this, Linux kernels have had to be patched to
add support for perfmon2 which provided access to the counters for PAPI
and for CrayPat.

 Seamless to users except –
 Overhead incurred when accessing counters has increased
 Creates additional application perturbation
 Working to bring this back in line with perfmon2 overhead


 When possible, CrayPat will identify dominant communication grids
(communication patterns) in a program
 Example: nearest neighbor exchange in 2 or 3 dimensions
 Sweep3d uses a 2-D grid for communication

 Determine whether or not a custom MPI rank order will produce a
significant performance benefit

 Custom rank orders are helpful for programs with significant point-to-point
communication

 Doesn’t interfere with MPI collective communication optimizations


 Focuses on intra-node communication (place ranks that communication
frequently on the same node, or close by)
 Option to focus on other metrics such as memory bandwidth

 Determine rank order used during run that produced data
 Determine grid that defines the communication

 Produce a custom rank order if it’s beneficial based on grid size, grid order
and cost metric

 Summarize findings in report
 Describe how to re-run with custom rank order


For Sweep3d with 768 MPI ranks:

This application uses point-to-point MPI communication between nearest
neighbors in a 32 X 24 grid pattern. Time spent in this communication
accounted for over 50% of the execution time. A significant fraction (but
not more than 60%) of this time could potentially be saved by using the
rank order in the file MPICH_RANK_ORDER.g which was generated along
with this report.

To re-run with a custom rank order …


 Assist the user with application performance analysis and optimization
 Help user identify important and meaningful information from
potentially massive data sets
 Help user identify problem areas instead of just reporting data
 Bring optimization knowledge to a wider set of users

 Focus on ease of use and intuitive user interfaces
 Automatic program instrumentation
 Automatic analysis

 Target scalability issues in all areas of tool development
 Data management
 Storage, movement, presentation


 Supports traditional post-mortem performance analysis
 Automatic identification of performance problems
 Indication of causes of problems
 Suggestions of modifications for performance improvement

 CrayPat
 pat_build: automatic instrumentation (no source code changes needed)
 run-time library for measurements (transparent to the user)
 pat_report for performance analysis reports
 pat_help: online help utility

 Cray Apprentice2
 Graphical performance analysis and visualization tool


 CrayPat
 Instrumentation of optimized code
 No source code modification required
 Data collection transparent to the user
 Text-based performance reports
 Derived metrics
 Performance analysis

 Cray Apprentice2
 Performance data visualization tool
 Call tree view
 Source code mappings


 When performance measurement is triggered
 External agent (asynchronous)
 Sampling
 Timer interrupt
 Hardware counters overflow
 Internal agent (synchronous)
 Code instrumentation
 Event based
 Automatic or manual instrumentation
 How performance data is recorded
 Profile ::= Summation of events over time
 run time summarization (functions, call sites, loops, …)
 Trace file ::= Sequence of events over time


 Millions of lines of code
 Automatic profiling analysis
 Identifies top time consuming routines
 Automatically creates instrumentation template customized to your
application
 Lots of processes/threads
 Load imbalance analysis
 Identifies computational code regions and synchronization calls that
could benefit most from load balance optimization
 Estimates savings if corresponding section of code were balanced
 Long running applications
 Detection of outliers


 Important performance statistics:

 Top time consuming routines

 Load balance across computing resources

 Communication overhead

 Cache utilization

 FLOPS

 Vectorization (SSE instructions)

 Ratio of computation versus communication


 No source code or makefile modification required
 Automatic instrumentation at group (function) level
 Groups: mpi, io, heap, math SW, …

 Performs link-time instrumentation
 Requires object files
 Instruments optimized code
 Generates stand-alone instrumented program
 Preserves original binary
 Supports sample-based and event-based instrumentation


 Analyze the performance data and direct the user to meaningful
information

 Simplifies the procedure to instrument and collect performance data for
novice users

 Based on a two phase mechanism
1. Automatically detects the most time consuming functions in the
application and feeds this information back to the tool for further
(and focused) data collection

2. Provides performance information on the most significant parts of the
application


 Performs data conversion

 Combines information from binary with raw performance
data

 Performs analysis on data

 Generates text report of performance results

 Formats data for input into Cray Apprentice2


 Craypat / Cray Apprentice2 5.0 released September 10, 2009

 New internal data format
 FAQ
 Grid placement support
 Better caller information (ETC group in pat_report)
 Support larger numbers of processors
 Client/server version of Cray Apprentice2
 Panel help in Cray Apprentice2


 Access performance tools software

% module load perftools

 Build application keeping .o files (CCE: -h keepfiles)

% make clean
% make

 Instrument application for automatic profiling analysis
 You should get an instrumented program a.out+pat

% pat_build –O apa a.out

 Run application to get top time consuming routines
 You should get a performance file (“<sdatafile>.xf”) or
multiple files in a directory <sdatadir>

% aprun … a.out+pat (or qsub <pat script>)


 Generate report and .apa instrumentation file

% pat_report –o my_sampling_report [<sdatafile>.xf |
<sdatadir>]

 Inspect .apa file and sampling report

 Verify if additional instrumentation is needed

June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 148

# You can edit this file, if desired, and use it # 43.37% 99659 bytes
# to reinstrument the program for tracing like this: -T mlwxyz_
#
# pat_build -O mhd3d.Oapa.x+4125-401sdt.apa # 16.09% 17615 bytes
# -T half_
# These suggested trace options are based on data from:
# # 6.82% 6846 bytes
# /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, -T artv_
/home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf

# 1.29% 5352 bytes
# ----------------------------------------------------------------------
-T currenh_

# HWPC group to collect by default.
# 1.03% 25294 bytes
-T bndbo_
-Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics.

# Functions below this point account for less than 10% of samples.
# ----------------------------------------------------------------------

# Libraries to trace.
# 1.03% 31240 bytes
# -T bndto_
-g mpi

...
# ----------------------------------------------------------------------

# ----------------------------------------------------------------------
# User-defined functions to trace, sorted by % of samples.
# Limited to top 200. A function is commented out if it has < 1%
-o mhd3d.x+apa # New instrumented program.
# of samples, or if a cumulative threshold of 90% has been reached,
# or if it has size < 200 bytes.
/work/crayadm/ldr/mhd3d/mhd3d.x # Original program.

# Note: -u should NOT be specified as an additional option.

June 20, 2011 149

 biolib Cray Bioinformatics library routines  omp OpenMP API (not supported on
 blacs Basic Linear Algebra communication Catamount)
subprograms  omp-rtl OpenMP runtime library (not
 blas Basic Linear Algebra subprograms supported on Catamount)

 caf Co-Array Fortran (Cray X2 systems only)  portals Lightweight message passing API
 fftw Fast Fourier Transform library (64-bit  pthreads POSIX threads (not supported on
only) Catamount)

 hdf5 manages extremely large and complex  scalapack Scalable LAPACK
data collections  shmem SHMEM
 heap dynamic heap  stdio all library functions that accept or return
 io includes stdio and sysio groups the FILE* construct

 lapack Linear Algebra Package  sysio I/O system calls

 lustre Lustre File System  system system calls

 math ANSI math  upc Unified Parallel C (Cray X2 systems only)
 mpi MPI
 netcdf network common data form (manages
array-oriented scientific data)


0 Summary with instruction 11 Floating point operations
metrics mix (2)
1 Summary with TLB metrics 12 Floating point operations
mix (vectorization)
2 L1 and L2 metrics
13 Floating point operations
3 Bandwidth information mix (SP)
4 Hypertransport information 14 Floating point operations
5 Floating point mix mix (DP)
6 Cycles stalled, resources 15 L3 (socket-level)
idle 16 L3 (core-level reads)
7 Cycles stalled, resources 17 L3 (core-level misses)
full
18 L3 (core-level fills caused
8 Instructions and branches by L2 evictions)
9 Instruction cache 19 Prefetches
10 Cache hierarchy

2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 151

 Regions, useful to break up long routines
 int PAT_region_begin (int id, const char *label)
 int PAT_region_end (int id)
 Disable/Enable Profiling, useful for excluding initialization
 int PAT_record (int state)
 Flush buffer, useful when program isn’t exiting cleanly
 int PAT_flush_buffer (void)


 Instrument application for further analysis (a.out+apa)

% pat_build –O <apafile>.apa

 Run application

% aprun … a.out+apa (or qsub <apa script>)

 Generate text report and visualization file (.ap2)

% pat_report –o my_text_report.txt [<datafile>.xf |
<datadir>]

 View report in text and/or with Cray Apprentice2

% app2 <datafile>.ap2


 MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)

 Number of files used to store raw data

 1 file created for program with 1 – 256 processes

 √n files created for program with 257 – n processes

 Ability to customize with PAT_RT_EXPFILE_MAX


 Full trace files show transient events but are too large

 Current run-time summarization misses transient events

 Plan to add ability to record:

 Top N peak values (N small)‫‏‬

 Approximate std dev over time

 For time, memory traffic, etc.

 During tracing and sampling


 Call graph profile  Cray Apprentice2
 Communication statistics  is target to help identify and
 Time-line view correct:
 Communication  Load imbalance
 I/O  Excessive communication
 Network contention
 Activity view
 Excessive serialization
 Pair-wise communication statistics  I/O Problems
 Text reports
 Source code mapping

June 20, 2011 157

Switch Overview display


Min, Avg, and Max
Values

-1, +1
Std Dev
marks


Width  inclusive time

Height  exclusive time

Filtered
nodes or
sub tree
Load balance overview:
Height  Max time
Middle bar  Average time
DUH Button:
Lower bar  Min time
Provides hints
Yellow represents for performance
imbalance time tuning

Function
Zoom
List


Right mouse click:
Node menu
e.g., hide/unhide
children
Right mouse click:
View menu:
e.g., Filter

Sort options
% Time,
Time,
Imbalance %
Imbalance time

Function
List off


Min, Avg, and Max
Values

-1, +1
Std Dev
marks


 Cray Apprentice2 panel help

 pat_help – interactive help on the Cray Performance toolset

 FAQ available through pat_help


 intro_craypat(1)
 Introduces the craypat performance tool
 pat_build
 Instrument a program for performance analysis
 pat_help
 Interactive online help utility
 pat_report
 Generate performance report in both text and for use with GUI
 hwpc(3)
 describes predefined hardware performance counter groups
 papi_counters(5)
 Lists PAPI event counters
 Use papi_avail or papi_native_avail utilities to get list of events when
running on a specific architecture

pat_report: Help for -O option:

Available option values are in left column, a prefix can be specified:

ct -O calltree
defaults Tables that would appear by default.
heap -O heap_program,heap_hiwater,heap_leaks
io -O read_stats,write_stats
lb -O load_balance
load_balance -O lb_program,lb_group,lb_function
mpi -O mpi_callers
---
callers Profile by Function and Callers
callers+hwpc Profile by Function and Callers
callers+src Profile by Function and Callers, with Line Numbers
callers+src+hwpc Profile by Function and Callers, with Line Numbers
calltree Function Calltree View
calltree+hwpc Function Calltree View
calltree+src Calltree View with Callsite Line Numbers
calltree+src+hwpc Calltree View with Callsite Line Numbers
...


 Interactive by default, or use trailing '.' to just print a topic:

 New FAQ craypat 5.0.0.

 Has counter and counter group information

% pat_help counters amd_fam10h groups .


The top level CrayPat/X help topics are listed below.
A good place to start is:
overview
If a topic has subtopics, they are displayed under the heading
"Additional topics", as below. To view a subtopic, you need
only enter as many initial letters as required to distinguish
it from other items in the list. To see a table of contents
including subtopics of those subtopics, etc., enter:
toc
To produce the full text corresponding to the table of contents,
specify "all", but preferably in a non-interactive invocation:
pat_help all . > all_pat_help
pat_help report all . > all_report_help
Additional topics:
API execute
balance experiment
build first_example
counters overview
demos report
environment run
pat_help (.=quit ,=back ^=up /=top ~=search)
=>


 ATP (Abnormal Termination Processing) or What do you do when task a
causes b to crash
 Load the ATP Module before compiling
 Set ATP_ENABLED before running
 Limitations
 ATP disables core dumping. When ATP is running, an applications crash
does not produce a core dump.
 When ATP is running, the application cannot be checkpointed.
 ATP does not support threaded application processes.
 ATP has been tested at 10,000 cores. Behavior at core counts greater
than 10,000 is still being researched.

Cray Proprietary April 19, 2011 177

Application 926912 is crashing. ATP analysis proceeding...

Stack walkback for Rank 3 starting:
_start@start.S:113
__libc_start_main@libc-start.c:220
main@testMPIApp.c:83
foo@testMPIApp.c:47
raise@pt-raise.c:42
Stack walkback for Rank 3 done
Process died with signal 4: 'Illegal instruction'
View application merged backtrace tree file
'atpMergedBT.dot' with 'statview'
You may need to 'module load stat'.


 What CCM is NOT
 It is Not a virtual machine or any os within an os
 It is NOT an emulator


 What is CCM Then?
 Provides the runtime environment on compute nodes expected by ISV
applications
 Dynamically allocates and configures compute nodes at job start
 Nodes are not permanently dedicated to CCM
 Any compute node can be used
 Allocated like any other batch job (on demand)

 MPI and third-party MPI runs over TCP/IP using high-speed network
 Supports standard services: ssh, rsh, nscd, ldap
 Complete root file system on the compute nodes
 Built on top of the Dynamic Shared Libraries (DSL) environment

 Apps run under CCM: Abaqus, Matlab, Castep, Discoverer, Dmo13,
Mesodyn, Ensight and more

Under CCM, everything the application can “see” is like a standard Linux
cluster: Linux OS, x86 processor, and MPI

Cray XT6/XE6 System

ESM Mode Runn
Compute Nodes
CCM Mode Runn
ESM Mode Idle

Service Nodes

• Many applications running in Extreme Scalability Mode (ESM)
• Submit CCM application through batch scheduler, nodes reserved
qsub –l ccm=1 Qname AppScript
• Previous jobs finish, nodes configured for CCM
• Executes the batch script and application
• Other nodes scheduled for ESM or CCM applications as available
• After CCM job completes, CCM nodes cleared
• CCM nodes available for ESM or CCM applications
Cray Product Roadmap - Presented Under NDA 11/03/2010

 Support MPIs that are configured to work with the OFED stack
 CCM1 supports ISV Applications over TCP/IP only
 CCM2 supports ISV Applications over TCP/IP and Gemini on XE6

 ISV Application Acceleration (IAA) directly utilizes HSN through the
Gemini user-space APIs.

 Goal of IAA/CCM2 is to deliver latency and bandwidth improvement
over CCM1 over TCP/IP.

 CCM2 infrastructure is currently in system test.
 IAA design and implementation phase is complete
 CCM2 with IAA is currently in integration test phase

 A code binary compiled for SLES and an Opteron
 DSO’s are OK
 A third party MPI library that can use TCP/IP
 We have tried OpenMPI, HP-MPI, LAM-MPI.
 Most of the bigger apps are packaged with their own library (usually
HP-MPI)
 Add CCMRUN to the run script.
 The IP address of the License server for the Applications
 Note that right now CCM cannot do an NSLOOKUP
 LMHOSTS must be specified by IP address
 With CLE 4.0: An MPI Library that IBVERBS


 CCMRUN: Analogous to aprun runs a third party batch job
 In Most cases if you already have a runscript for your third party app
adding ccmrun prior to the application command will set it up.
 CCMLOGIN: Allows interactive access to the head node of a allocated
compute pool. Takes Optional ssh options
 CCM uses the ssh known_hosts to set up a an paswordless ssh between a
set of compute nodes. You can go to allocated nodes but no further.


 External Login
Servers
XE6
 Internal Login System

Nodes (PBS Nodes)
 Compute Nodes
External
Login Server

Boot RAID 10 GbE
IB QDR


 External Login Nodes: Dell 4 socket servers that the user enters the
System Over

 PBS Nodes: Internal single socket 6 core nodes that run the PBS MOM’s
 Aprun must be issued from a node on the System Database

 Compute Nodes: 2 Socket 8 Core Opteron nodes that run trimmed down
OS (still Linux)


news: diskuse_work diskuse_home system_info.txt
aminga@garnet01:~> uname -a
Linux garnet01 2.6.27.48-0.12-default #1 SMP 2010-09-20 11:03:26 -0400
x86_64 x86_64 x86_64 GNU/Linux

aminga@garnet01:~> qsub -I -lccm=1 -q debug -l walltime=01:00:00 -l
ncpus=32 -A ERDCS97290STA

qsub: waiting for job 104868.sdb to start
qsub: job 104868.sdb ready
In CCM JOB: 104868.sdb JID sdb USER aminga GROUP erdcssta
Initializing CCM environment, Please Wait
CCM Start success, 2 of 2 responses
aminga@garnet13:~> uname -a
Linux garnet13 2.6.27.48-0.12.1_1.0301.5737-cray_gem_s #1 SMP Mon Mar 28
22:20:59 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux


aminga@garnet13:~> cat
$PBS_NODEFILE
nid00972 aminga@garnet13:~> ccmlogin
nid00972 Last login: Mon Jun 13 13:03:26 2011 from nid01028
nid00972
nid00972
nid00972
-------------------------------------------------------------------------------
nid00972
nid00972 aminga@nid00972:~> uname -a
nid00972 Linux nid00972 2.6.27.48-0.12.1_1.0301.5737-cray_gem_c #1
<snip> SMP Mon Mar 28 22:26:26 UTC 2011 x86_64 x86_64 x86_64
nid01309 GNU/Linux
nid01309
nid01309 aminga@nid00972:~> ssh nid01309
nid01309 Try `uname --help' for more information.
nid01309
aminga@nid01309:~> uname -a
nid01309
nid01309 Linux nid01309 2.6.27.48-0.12.1_1.0301.5737-cray_gem_c #1
nid01309 SMP Mon Mar 28 22:26:26 UTC 2011 x86_64 x86_64 x86_64
GNU/Linux
aminga@nid01309:~>
aminga@nid00972:~> ssh nid01310
Redirecting to /etc/ssh/ssh_config
ssh: connect to host nid01310 portProprietaryConnection refused
Cray 203: April 19, 2011 190

#!/bin/csh
#PBS -l mppwidth=2
#PBS -l mppnppn=1
#PBS -q ccm_queue
#PBS -j oe

cd $PBS_O_WORKDIR

perl ConstructMachines.LINUX.pl
setenv DSD_MachineLIST $PBS_O_WORKDIR/machines.LINUX
setenv MPI_COMMAND "
/usr/local/applic/accelrys/MSModeling5.5/hpmpi/opt/hpmpi/bin/mpi
run -np "

ccmrun ./RunDiscover.sh -np 2 nvt_m


#PBS -l mppwidth=2
#PBS -l mppnppn=1
#PBS -j oe
#PBS -N gauss-test-ccm
#PBS -q ccm_queue

cd $PBS_O_WORKDIR
cp $PBS_NODEFILE node_file
./CreatDefaultRoute.pl
mkdir -p scratch
setenv DVS_CACHE off
setenv g09root /usr/local/applic/gaussian/
setenv GAUSS_EXEDIR ${g09root}/g09
setenv GAUSS_EXEDIR ${g09root}/g09/linda-exe:$GAUSS_EXEDIR
setenv GAUSS_SCRDIR `pwd`
setenv TMPDIR `pwd`
source ${g09root}/g09/bsd/g09.login

setenv GAUSS_LFLAGS "-vv -nodefile node_file -opt Tsnet.Node.lindarsharg:ssh"

setenv LINDA_PATH ${g09root}/g09/linda8.2/opteron-linux
set LINDA_LAUNCHVERBOSE=1

ccmrun ${g09root}/g09/g09 < gauss-test-ccm.com
setenv TEND `echo "print time();" | perl`
echo "Gaussian CCM walltime: `expr $TEND - $TBEGIN` seconds"

cd $PBS_O_WORKDIR

/bin/rm -rf bhost.def
cat $PBS_NODEFILE > bhost.def

/bin/rm -rf job.script
cat > job.script << EOD
#!/bin/csh
set echo
cd $PWD
setenv AEROSOFT_HOME /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft
setenv LAMHOME /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft
setenv PATH /work/aminga/captest/isvdata/GASP/GASPSTD/aerosoft/bin:$PATH

setenv TMPDIR /work/aminga
ln -s /usr/lib64/libpng.so libpng.so.3
setenv LD_LIBRARY_PATH `pwd`:$LD_LIBRARY_PATH

setenv LAMRSH "ssh -x"
lamboot bhost.def

time mpirun -np 2 -x LD_LIBRARY_PATH gasp --mpi -i duct.xml --run 2 --elmhost 140.31.
9.44

EOD

chmod +x job.script
ccmrun job.script

#!/bin/sh
#PBS -q ccm_queue
#PBS -lmppwidth=48
#PBS -j oe
#PBS -N CFX

cd $PBS_O_WORKDIR

TOP_DIR=/usr/local/applic/ansys
export ANSYSLIC_DIR=$TOP_DIR/shared_files/licensing
export LD_LIBRARY_PATH=$TOP_DIR/v121/CFX/tools/hpmpi-2.3/Linux-amd64/lib/linux_amd64:
$LD_LIBRARY_PATH
export PATH=$TOP_DIR/v121/CFX/bin:$PATH

export CFX5RSH=ssh
export MPIRUN_OPTIONS="-TCP -prot -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7,8,9,10,11,12,13,1
4,15,16,17,18,19,20,21,22,23"

/bin/rm -rf host.list
cat $PBS_NODEFILE > host.list

export proc_list=`sort host.list | uniq -c | awk '{ printf("%s*%s ", $2, $1) ; }'`
echo $proc_list

which cfx5solve
ccmrun cfx5solve
-def S*400k.def -par-dist "$proc_list" -start-method "HP MPI Distributed Para
llel"
rm -f host.list


#!/bin/bash
#PBS -lmppwidth=16
#PBS -q ccm_queue
#PBS -j oe
#PBS -N abaqus_e1

cd $PBS_O_WORKDIR

TMPDIR=.

ABAQUS=/usr/local/applic/abaqus
#cp ${ABAQUS}/input/e1.inp e1.inp
cat $PBS_NODEFILE
echo "Run Abaqus"
ccmrun ${ABAQUS}/6.10-1/exec/abq6101.exe input=e1.inp job=e1 cpus=16
interactive


#!/bin/csh
#PBS -q ccm_queue
#PBS -l mppwidth=32
#PBS -j oe
#PBS -N AFRL_Fluent
cd $PBS_O_WORKDIR

setenv FLUENT_HOME /usr/local/applic/fluent/12.1/fluent

setenv FLUENT_ARCH lnamd64
setenv PATH /usr/local/applic/fluent/12.1/v121/fluent/bin:$PATH
setenv FLUENT_INC /usr/local/applic/fluent/12.1/v121/fluent
###setenv LM_LICENSE_FILE 7241@10.128.0.72
setenv LM_LICENSE_FILE 27000@10.128.0.76
setenv ANSYSLMD_LICENSE_FILE /home/applic/ansys/shared_files/licensing/license.dat
echo ${LM_LICENSE_FILE}

setenv FLUENT_VERSION -r12.1.1

cd $PBS_O_WORKDIR

rm -rf host.list
cat $PBS_NODEFILE > host.list
module load ccm dot

setenv MALLOC_MMAP_MAX_ 0
setenv MALLOC_TRIM_THRESHOLD_ 536870912
setenv MPIRUN_OPTIONS " -TCP -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7"
setenv MPIRUN_OPTIONS "${MPIRUN_OPTIONS},8,9,10,11,12,13,14,15 "
setenv MPI_SOCKBUFSIZE 524288
setenv MPI_WORKDIR $PWD
setenv MPI_COMMD 1024,1024

ccmrun /usr/local/applic/fluent/v121/fluent/bin/fluent -r12.1.2 2ddp -mpi=hp -gu -dri
ver null -t4 -i blast.inp > tstfluent-blast.jobout Cray Proprietary April 19, 2011 196

 ALPS allows you to run aprun instance per node. Using CCM you can get
around that.
 So suppose you want to run 16 single core jobs and only use one node
qsub -lccm=1 -q debug -l walltime=01:00:00 -l ncpus=16 -A
ERDCS97290STA
#PBS –J oe
cd $PBS_O_WORKDIR
./myapp&
./myapp&
./myapp&
./myapp&
./myapp&
./myapp&


Engineering for Multi-level Parallelism


 Flat, all-MPI parallelism is beginning to be too limited as the number of
compute cores rapidly increase
 It is becoming necessary to design applications with multiple levels of
parallelism:

 High-level MPI parallelism between nodes
 You’re probably already doing this
 Loose, on-node parallelism via threads at a high level
 Most codes today are using MPI, but threading is becoming more
important
 Tight, on-node, vector parallelism at a low level
 SSE/AVX on CPUs
 GPU threaded parallelism

Programmers need to expose the same parallelism for all future architectures


 A benchmark problem was defined to closely resemble the target simulation
 52 species n-heptane chemistry and 483 grid points per node

– 483 * 18,500 nodes = 2 billion Chemistry
grid points
– Target problem would take two
months on today’s Jaguar
• Code was benchmarked and
profiled on dual-hexcore XT5
• Several kernels identified and
extracted into stand-alone driver
programs
Mini-Apps! Core S3D


Goals:
 Convert S3D to a hybrid multi-core application suited for a multi-core node with
or without an accelerator.
 Hoisted several loops up the call tree
 Introduced high-level OpenMP
 Be able to perform the computation entirely on the accelerator if available.
- Arrays and data able to reside entirely on the accelerator.
- Data sent from accelerator to host CPU for halo communication, I/O and
monitoring only.
Strategy:
 To program using both hand-written and generated code.
- Hand-written and tuned CUDA*.
- Automated Fortran and CUDA generation for chemistry kernels
- Automated code generation through compiler directives
 S3D kernels are now a part of Cray’s compiler development test cases
* Note: CUDA refers to CUDA-Fortran, unless mentioned otherwise

RHS – Called 6 times for each time step –
Runge Kutta iterations All major loops are at low level of
Calculate Primary Variable – point wise
the
Mesh loops within 5 different routines Call tree
Green – major computation –
point-wise
Perform Derivative computation – High
order differencing
Yellow – major computation –
Halos 5 zones thick

Calculate Diffusion – 3 different
routines with some derivative
computation

Perform Derivative computation for
forming rhs – lots of communication

Perform point-wise chemistry
computation


RHS – Called 6 times for each time step –
Runge Kutta iterations

Calculate Primary Variable – point wise
OMP loop over grid Mesh loops within 3 different routines

Perform Derivative computation – High
order differencing
Overlapped
OMP loop over grid Calculate Primary Variable – point wise
Mesh loops within 2 different routines

Calculate Diffusion – 3 different routines
with some derivative computation

Perform derivative computation

Overlapped
OMP loop over grid Perform point-wise chemistry
computation (1)

Perform Derivative computation for
forming rhs – lots of communication
Overlapped
OMP loop over grid Perform point-wise chemistry
computation (2)

2011 HPCMP User Group © Cray Inc. June 20, 2011

 Create good granularity OpenMP Loop
 Improves cache re-use
 Reduces Memory usage significantly
 Creates a good potential kernel for an accelerator

2011 HPCMP User Group © Cray Inc. 205
June 20, 2011

CPU Optimizations
Optimizing Communication
I/O Best Practices


55. 1 ii = 0
56. 1 2-----------< do b = abmin, abmax Poor loop order
57. 1 2 3---------< do j=ijmin, ijmax results in poor
58. 1 2 3 ii = ii+1
striding
59. 1 2 3 jj = 0 The inner-most loop
60. 1 2 3 4-------< do a = abmin, abmax strides on a slow
61. 1 2 3 4 r8----< do i = ijmin, ijmax dimension of each
62. 1 2 3 4 r8 jj = jj+1
array.
63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j)
+ tmat7(ii,jj) The best the compiler
64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j) can do is unroll.
- tmat7(ii,jj)
65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i) Little to no cache
- tmat7(ii,jj) reuse.
66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i)
+ tmat7(ii,jj)
67. 1 2 3 4 r8----> end do
68. 1 2 3 4-------> end do
69. 1 2 3---------> end do
70. 1 2-----------> end do


USER / #1.Original Loops
----------------------------------------------------------------- Poor loop order
Time% 55.0% results in poor
Time 13.938244 secs cache reuse
Imb.Time 0.075369 secs
Imb.Time% 0.6% For every L1 cache
Calls 0.1 /sec 1.0 calls hit, there’s 2 misses
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Overall, only 2/3 of
L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills all references were in
DATA_CACHE_REFILLS_FROM_SYSTEM: level 1 or 2 cache.
ALL 11.931M/sec 166291054 fills
PAPI_L1_DCM 23.499M/sec 327533338 misses
PAPI_L1_DCA 34.635M/sec 482751044 refs
User time (approx) 13.938 secs 36239439807 cycles
100.0%Time
Average Time per Call 13.938244 sec
CrayPat Overhead : Time 0.0%
D1 cache hit,miss ratios 32.2% hits 67.8% misses
D2 cache hit,miss ratio 49.8% hits 50.2% misses
D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses


75. 1 2-----------< do i = ijmin, ijmax
76. 1 2 jj = 0
77. 1 2 3---------< do a = abmin, abmax Reordered loop
78. 1 2 3 4-------< do j=ijmin, ijmax
nest
79. 1 2 3 4 jj = jj+1 Now, the inner-most
80. 1 2 3 4 ii = 0 loop is stride-1 on
81. 1 2 3 4 Vcr2--< do b = abmin, abmax both arrays.
82. 1 2 3 4 Vcr2 ii = ii+1
83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j) Now memory
+ tmat7(ii,jj) accesses happen
84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j) along the cache line,
- tmat7(ii,jj) allowing reuse.
85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i)
- tmat7(ii,jj) Compiler is able to
86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i) vectorize and better-
+ tmat7(ii,jj) use SSE instructions.
87. 1 2 3 4 Vcr2--> end do
88. 1 2 3 4-------> end do
89. 1 2 3---------> end do
90. 1 2-----------> end do


USER / #2.Reordered Loops
----------------------------------------------------------------- Improved striding
Time% 31.4% greatly improved
Time 7.955379 secs cache reuse
Imb.Time% 3.8% Runtine was cut
Calls 0.1 /sec 1.0 calls nearly in half.
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Still, some 20% of all
L2_EXCLUSIVE:L2_SHARED 0.419M/sec 3331289 fills references are cache
DATA_CACHE_REFILLS_FROM_SYSTEM: misses
ALL 15.285M/sec 121598284 fills
PAPI_L1_DCM 13.330M/sec 106046801 misses
100.0%Time


First loop, partially vectorized and Second loop, vectorized and
unrolled by 4 unrolled by 4
95. 1 ii = 0 109. 1 jj = 0
96. 1 2-----------< do j = ijmin, ijmax 110. 1 2-----------< do i = ijmin, ijmax
97. 1 2 i---------< do b = abmin, abmax 111. 1 2 3---------< do a = abmin, abmax
98. 1 2 i ii = ii+1 112. 1 2 3 jj = jj+1
99. 1 2 i jj = 0 113. 1 2 3 ii = 0
100. 1 2 i i-------< do i = ijmin, ijmax 114. 1 2 3 4-------< do j = ijmin, ijmax
101. 1 2 i i Vpr4--< do a = abmin, abmax 115. 1 2 3 4 Vr4---< do b = abmin, abmax
102. 1 2 i i Vpr4 jj = jj+1 116. 1 2 3 4 Vr4 ii = ii+1
103. 1 2 i i Vpr4 f5d(a,b,i,j) = 117. 1 2 3 4 Vr4 f5d(b,a,i,j) =
f5d(a,b,i,j) + tmat7(ii,jj) f5d(b,a,i,j) - tmat7(ii,jj)
104. 1 2 i i Vpr4 f5d(a,b,j,i) = 118. 1 2 3 4 Vr4 f5d(b,a,j,i) =
f5d(a,b,j,i) - tmat7(ii,jj) f5d(b,a,i,j) + tmat7(ii,jj)
105. 1 2 i i Vpr4--> end do 119. 1 2 3 4 Vr4---> end do
106. 1 2 i i-------> end do 120. 1 2 3 4-------> end do
107. 1 2 i---------> end do 121. 1 2 3---------> end do
108. 1 2-----------> end do 122. 1 2-----------> end do


USER / #3.Fissioned Loops
Fissioning further
----------------------------------------------------------------- improved cache reuse
Time% 9.8% and resulted in better
Time 2.481636 secs vectorization
Imb.Time% 2.1% Runtime further
Calls 0.4 /sec 1.0 calls reduced.
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED: Cache hit/miss ratio
L2_EXCLUSIVE:L2_SHARED 1.175M/sec 2916610 fills improved slightly
DATA_CACHE_REFILLS_FROM_SYSTEM:
ALL 34.109M/sec 84646518 fills Loopmark file points
PAPI_L1_DCM 26.424M/sec 65575972 misses to better
vectorization from
100.0%Time the fissioned loops


( 52) C THE ORIGINAL
( 53)
( 54) DO 47020 J = 1, JMAX Triple nested loop at a high level
( 55) DO 47020 K = 1, KMAX
( 56) DO 47020 I = 1, IMAX
( 57) JP = J + 1
( 58) JR = J - 1
( 59) KP = K + 1
( 60) KR = K - 1
(
(
61)
62)
IP
IR
=
=
I
I
+
-
1
1
Ifs inside the inner loop can
( 63) IF (J .EQ. 1) GO TO 50 signifantly reduce the chances
( 64) IF( J .EQ. JMAX) GO TO 51
( 65) XJ = ( A(I,JP,K) - A(I,JR,K) ) * DA2
of vectorization
( 66) YJ = ( B(I,JP,K) - B(I,JR,K) ) * DA2
( 67) ZJ = ( C(I,JP,K) - C(I,JR,K) ) * DA2
( 68) GO TO 70
( 69) 50 J1 = J + 1
( 70) J2 = J + 2
( 71) XJ = (-3. * A(I,J,K) + 4. * A(I,J1,K) - A(I,J2,K) ) * DA2
( 72) YJ = (-3. * B(I,J,K) + 4. * B(I,J1,K) - B(I,J2,K) ) * DA2
( 73) ZJ = (-3. * C(I,J,K) + 4. * C(I,J1,K) - C(I,J2,K) ) * DA2
( 74) GO TO 70
( 75) 51 J1 = J - 1
( 76) J2 = J - 2
( 77) XJ = ( 3. * A(I,J,K) - 4. * A(I,J1,K) + A(I,J2,K) ) * DA2
( 78) YJ = ( 3. * B(I,J,K) - 4. * B(I,J1,K) + B(I,J2,K) ) * DA2
( 79) ZJ = ( 3. * C(I,J,K) - 4. * C(I,J1,K) + C(I,J2,K) ) * DA2
( 80) 70 CONTINUE
( 81) IF (K .EQ. 1) GO TO 52
( 82) IF (K .EQ. KMAX) GO TO 53
( 83) XK = ( A(I,J,KP) - A(I,J,KR) ) * DB2
( 84) YK = ( B(I,J,KP) - B(I,J,KR) ) * DB2
( 85) ZK = ( C(I,J,KP) - C(I,J,KR) ) * DB2
( 86) GO TO 71

continues… 2011 HPCMP User Group © Cray Inc. June 20, 2011 217

PGI
55, Invariant if transformation
Loop not vectorized: loop count too small


( 141) C THE RESTRUCTURED
( 142)
( 143) DO 47029 J = 1, JMAX
( 144) DO 47029 K = 1, KMAX
( 145) Stride-1 loop brought inside
( 146) IF(J.EQ.1)THEN
( 147) if statements
( 148) J1 = 2
( 149) J2 = 3
( 150) DO 47021 I = 1, IMAX
( 151) VAJ(I) = (-3. * A(I,J,K) + 4. * A(I,J1,K) - A(I,J2,K) ) * DA2
( 152) VBJ(I) = (-3. * B(I,J,K) + 4. * B(I,J1,K) - B(I,J2,K) ) * DA2
( 153) VCJ(I) = (-3. * C(I,J,K) + 4. * C(I,J1,K) - C(I,J2,K) ) * DA2
( 154) 47021 CONTINUE
( 155)
( 156) ELSE IF(J.NE.JMAX) THEN
( 157)
( 158) JP = J+1
( 159) JR = J-1
( 160) DO 47022 I = 1, IMAX
( 161) VAJ(I) = ( A(I,JP,K) - A(I,JR,K) ) * DA2
( 162) VBJ(I) = ( B(I,JP,K) - B(I,JR,K) ) * DA2
( 163) VCJ(I) = ( C(I,JP,K) - C(I,JR,K) ) * DA2
( 164) 47022 CONTINUE
( 165)
( 166) ELSE
( 167)
( 168) J1 = JMAX-1
( 169) J2 = JMAX-2
( 170) DO 47023 I = 1, IMAX
( 171) VAJ(I) = ( 3. * A(I,J,K) - 4. * A(I,J1,K) + A(I,J2,K) ) * DA2
( 172) VBJ(I) = ( 3. * B(I,J,K) - 4. * B(I,J1,K) + B(I,J2,K) ) * DA2
( 173) VCJ(I) = ( 3. * C(I,J,K) - 4. * C(I,J1,K) + C(I,J2,K) ) * DA2
( 174) 47023 CONTINUE
( 175)
( 176) ENDIF
Continues… 2011 HPCMP User Group © Cray Inc. June 20, 2011 219

PGI
Loop not vectorized: loop count too small
150, Generated 3 alternate loops for the inner loop
Generated vector sse code for inner loop
Generated 8 prefetch instructions for this loop
160, Generated 4 alternate loops for the inner loop
ooo


2500

2000

M
1500
F
L
O
P 1000
S

500

0
0 50 100 150 200 250 300 350 400 450 500
Vector Length

CCE-Original - Fortran CCE-Restructured- Fortran
PGI-Original - Fortran PGI-Restructured - Fortran

 Max Vector length doubled to 256 bit
 Much cleaner instruction set
 Result register is unique from the source registers
 Old SSE instruction set always destroyed a source register

 Floating point multiple-accumulate
 A(1:4) = B(1:4)*C(1:4) + D(1:4) ! Now one instruction

 Next gen of both AMD and Intel will have AVX

 Vectors are becoming more important, not less


 Cache blocking is a combination of strip mining and loop interchange, designed
to increase data reuse.
 Takes advantage of temporal reuse: re-reference array elements already
referenced
 Good blocking will take advantage of spatial reuse: work with the cache
lines!
 Many ways to block any given loop nest
 Which loops get blocked?
 What block size(s) to use?
 Analysis can reveal which ways are beneficial
 But trial-and-error is probably faster


j=1

j=8
 2D Laplacian
i=1 do j = 1, 8
do i = 1, 16
a = u(i-1,j) + u(i+1,j) &
- 4*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
end do

 Cache structure for this example:
 Each line holds 4 array elements
 Cache can hold 12 lines of u data
i=16
 No cache reuse between outer loop
120
30
18
15
13
12
10
9
7
6
4
3 iterations


j=1

j=8
 Unblocked loop: 120 cache misses
i=1  Block the inner loop

do IBLOCK = 1, 16, 4
do j = 1, 8
i=5 do i = IBLOCK, IBLOCK + 3
a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
i=9 + u(i,j-1) + u(i,j+1)
end do
end do
end do
i=13
 Now we have reuse of the “j+1” data

80
20
12
10
11
9
8
7
6
4
3


j=1

j=5
 One-dimensional blocking reduced
i=1 misses from 120 to 80
 Iterate over 4×4 blocks

i=5 do JBLOCK = 1, 8, 4
do IBLOCK = 1, 16, 4
do j = JBLOCK, JBLOCK + 3
do i = IBLOCK, IBLOCK + 3
i=9 a(i,j) = u(i-1,j) + u(i+1,j) &
- 2*u(i,j) &
+ u(i,j-1) + u(i,j+1)
end do
i=13
end do
end do
end do

15
13
12
10
60
30
18
17
16
11
9
8
7
6
4
3
 Better use of spatial locality (cache lines)


 Matrix-matrix multiply (GEMM) is the canonical cache-blocking example
 Operations can be arranged to create multiple levels of blocking
 Block for register
 Block for cache (L1, L2, L3)
 Block for TLB
 No further discussion here. Interested readers can see
 Any book on code optimization
 Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in
Chapter 8
 Insert your favorite book here
 Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication
algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR-
2001-22, The University of Texas at Austin, Department of Computer Sciences
 Develops algorithms and cost models for GEMM in hierarchical memories
 Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM
Transactions on Mathematical Software 34, 3 (May), 1-25
 Description of GotoBLAS DGEMM


“I tried cache-blocking my code, but it didn’t help”

 You’re doing it wrong.
 Your block size is too small (too much loop overhead).
 Your block size is too big (data is falling out of cache).
 You’re targeting the wrong cache level (?)
 You haven’t selected the correct subset of loops to block.
 The compiler is already blocking that loop.
 Prefetching is acting to minimize cache misses.
 Computational intensity within the loop nest is very large, making blocking less
important.


 Multigrid PDE solver
 Class D, 64 MPI ranks
do i3 = 2, 257
 Global grid is 1024 × 1024 × 1024 do i2 = 2, 257
 Local grid is 258 × 258 × 258 do i1 = 2, 257
 Two similar loop nests account for ! update u(i1,i2,i3)
! using 27-point stencil
>50% of run time
end do
 27-point 3D stencil end do
 There is good data reuse along end doi2+1
cache lines
i2
leading dimension, even without i2-1
blocking i3+1

i3

i3-1

i1-1 i1 i1+1


 Block the inner two loops Mop/s/proces
Block size
 Creates blocks extending along i3 direction s
unblocked 531.50
do I2BLOCK = 2, 257, BS2
do I1BLOCK = 2, 257, BS1 16 × 16 279.89
do i3 = 2, 257
22 × 22 321.26
do i2 = I2BLOCK, &
min(I2BLOCK+BS2-1, 257) 28 × 28 358.96
do i1 = I1BLOCK, &
min(I1BLOCK+BS1-1, 257) 34 × 34 385.33
! update u(i1,i2,i3)
! using 27-point stencil 40 × 40 408.53
end do
end do
46 × 46 443.94
end do 52 × 52 468.58
end do
end do 58 × 58 470.32
64 × 64 512.03
70 × 70 506.92

 Block the outer two loops Mop/s/proces
Block size
 Preserves spatial locality along i1 direction s
unblocked 531.50
do I3BLOCK = 2, 257, BS3
do I2BLOCK = 2, 257, BS2 16 × 16 674.76
do i3 = I3BLOCK, &
22 × 22 680.16
min(I3BLOCK+BS3-1, 257)
do i2 = I2BLOCK, & 28 × 28 688.64
min(I2BLOCK+BS2-1, 257)
do i1 = 2, 257 34 × 34 683.84
! update u(i1,i2,i3)
! using 27-point stencil 40 × 40 698.47
end do
end do
46 × 46 689.14
end do 52 × 52 706.62
end do
end do 58 × 58 692.57
64 × 64 703.40
70 × 70 693.87

( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,
int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
C pointers
( 57) double con; /* constant value */
C pointers don’t carry
( 58)
( 59) rowb = cola;
the same rules as
( 60) rowc = rowa; Fortran Arrays.
( 61) colc = colb;
( 62) The compiler has no
( 63) for(i=0;i<rowc;i++) { way to know whether
( 64) for(k=0;k<cola;k++) {
*a, *b, and *c
( 65) con = *(a + i*cola +k);
( 66) for(j=0;j<colc;j++) { overlap or are
( 67) *(c + i*colc + j) += con * *(b + k*colb + j); referenced differently
( 68) } elsewhere.
( 69) }
( 70) }
The compiler must
( 71) }
assume the worst,
mat_mul_daxpy: thus a false data
66, Loop not vectorized: data dependency dependency.
Loop not vectorized: data dependency
Loop unrolled 4 times


( 53) void mat_mul_daxpy(double* restrict a, double* restrict b,
double* restrict c, int rowa, int cola, int colb)
( 54) {
( 55) int i, j, k; /* loop counters */
C pointers,
( 56) int rowc, colc, rowb; /* sizes not passed as arguments */
restricted
( 57) double con; /* constant value */
C99 introduces the
( 58)
( 59) rowb = cola;
restrict keyword,
( 60) rowc = rowa; which allows the
( 61) colc = colb; programmer to
( 62)
promise not to
( 63) for(i=0;i<rowc;i++) {
( 64) for(k=0;k<cola;k++) {
reference the
( 65) con = *(a + i*cola +k); memory via another
( 66) for(j=0;j<colc;j++) { pointer.
( 67) *(c + i*colc + j) += con * *(b + k*colb + j);
( 68) } If you declare a
( 69) }
restricted pointer and
( 70) }
( 71) } break the rules,
behavior is undefined
by the standard.


66, Generated alternate loop with no peeling - executed if loop count <= 24
Generated alternate loop with no peeling and more aligned moves -
executed if loop count <= 24 and alignment test is passed
Generated alternate loop with more aligned moves - executed if loop
count >= 25 and alignment test is passed

• This can also be achieved with the PGI safe pragma or –Msafeptr
compiler option or Pathscale –OPT:alias option


 GNU malloc library
 malloc, calloc, realloc, free calls
 Fortran dynamic variables
 Malloc library system calls
 Mmap, munmap =>for larger allocations
 Brk, sbrk => increase/decrease heap
 Malloc library optimized for low system memory use
 Can result in system calls/minor page faults


 Detecting “bad” malloc behavior
 Profile data => “excessive system time”
 Correcting “bad” malloc behavior
 Eliminate mmap use by malloc
 Increase threshold to release heap memory
 Use environment variables to alter malloc
 MALLOC_MMAP_MAX_ = 0
 MALLOC_TRIM_THRESHOLD_ = 536870912
 Possible downsides
 Heap fragmentation
 User process may call mmap directly
 User process may launch other processes
 PGI’s –Msmartalloc does something similar for you at compile time


 Google created a replacement “malloc” library
 “Minimal” TCMalloc replaces GNU malloc
 Limited testing indicates TCMalloc as good or better
than GNU malloc
 Environment variables not required
 TCMalloc almost certainly better for allocations in
OpenMP parallel regions
 There’s currently no pre-built tcmalloc for Cray XT/XE,
but some users have successfully built it.


 Linux has a “first touch policy” for memory allocation
 *alloc functions don’t actually allocate your memory
 Memory gets allocated when “touched”
 Problem: A code can allocate more memory than available
 Linux assumed “swap space,” we don’t have any
 Applications won’t fail from over-allocation until the memory is finally
touched
 Problem: Memory will be put on the core of the “touching” thread
 Only a problem if thread 0 allocates all memory for a node
 Solution: Always initialize your memory immediately after allocating it
 If you over-allocate, it will fail immediately, rather than a strange place
in your code
 If every thread touches its own memory, it will be allocated on the
proper socket


This may help both compute and communication.


 Opterons support 4K, 2M, and 1G pages
 We don’t support 1G pages
 4K pages are used by default
 2M pages are more difficult to use, but…

 Your code may run with fewer TLB misses (hence faster).
 The TLB can address more physical memory with 2M pages than
with 4K pages
 The Gemini perform better with 2M pages than with 4K pages.
 2M pages use less GEMINI resources than 4k pages (fewer bytes).


 Link in the hugetlbfs library into your code ‘-lhugetlbfs’
 Set the HUGETLB_MORECORE env in your run script.
 Example : export HUGETLB_MORECORE=yes
 Use the aprun option –m###h to ask for ### Meg of HUGE
pages.
 Example : aprun –m500h (Request 500 Megs of HUGE pages as
available, use 4K pages thereafter)
 Example : aprun –m500hs (Request 500 Megs of HUGE pages, if
not available terminate launch)
 Note: If not enough HUGE pages are available, the cost of
filling the remaining with 4K pages may degrade performance.


 Short Message Eager Protocol

 The sending rank “pushes” the message to the receiving rank
 Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less
 Sender assumes that receiver can handle the message
 Matching receive is posted - or -
 Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and
buffer space (MPICH_UNEX_BUFFER_SIZE) to store the message

 Long Message Rendezvous Protocol

 Messages are “pulled” by the receiving rank
 Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes
 Sender sends small header packet with information for the receiver to pull
over the data
 Data is sent only after matching receive is posted by receiving rank


Match Entries Posted by MPI
Incoming Msg S to handle Unexpected Msgs
E
A Eager Rendezvous
S App ME
Short Msg ME Long Msg ME
T
A
R

STEP 1
STEP 2
MPI_RECV call
Sender MPI_SEND call Receiver Post ME to Portals MPI
RANK 0 RANK 1 Unexpected
Buffers

(MPICH_UNEX_BUFFER_SIZE)
STEP 3
Portals DMA PUT Unexpected
Msg Queue

Other Event Queue
(MPICH_PTL_OTHER_EVENTS)

Unexpected
Event Queue
MPI_RECV is posted prior to MPI_SEND call (MPICH_PTL_UNEX_EVENTS)


MPT Eager Protocol on SeaStar Match Entries Posted by MPI
Data “pushed” to the receiver Incoming Msg S to handle Unexpected Msgs
(MPICH_MAX_SHORT_MSG_SIZE bytes or less) E
A Eager Rendezvous
S Short Msg ME Long Msg ME
T
A
R

STEP 3
STEP 1 MPI_RECV call
Sender MPI_SEND call Receiver No Portals ME
MPI
RANK 0 RANK 1 Unexpected
Buffers

STEP 2
STEP 4 (MPICH_UNEX_BUFFER_SIZE)
Portals DMA PUT
Memcpy of data

Unexpected
Msg Queue

Unexpected
Event Queue
(MPICH_PTL_UNEX_EVENTS)
MPI_RECV is not posted prior to MPI_SEND call

Match Entries Posted by MPI
Incoming Msg S to handle Unexpected Msgs
E
A
Eager Rendezvous
S
T Short Msg ME Long Msg ME
A
App ME R

STEP 1 STEP 3
MPI_SEND call MPI_RECV call
Portals ME created Triggers GET request
Sender Receiver
MPI
RANK 0 RANK 1
Unexpected
STEP 2
Buffers
Portals DMA PUT
of Header

STEP 4
Receiver issues
GET request to
match Sender ME
Unexpected
Msg Queue

STEP 5
Portals DMA of Data

Unexpected
Event Queue
Data is not sent until MPI_RECV is issued


 The default ordering can be changed using the following
environment variable:
MPICH_RANK_REORDER_METHOD
 These are the different values that you can set it to:
0: Round-robin placement – Sequential ranks are placed on the next node in the
list. Placement starts over with the first node upon reaching the end of the list.
1: SMP-style placement – Sequential ranks fill up each node before moving to the
next.
2: Folded rank placement – Similar to round-robin placement except that each pass
over the node list is in the opposite direction of the previous pass.
3: Custom ordering. The ordering is specified in a file named
MPICH_RANK_ORDER.
 When is this useful?
 Point-to-point communication consumes a significant fraction of program time and a
load imbalance detected
 Also shown to help for collectives (alltoall) on subcommunicators (GYRO)
 Spread out IO across nodes (POP)


 One can also use the CrayPat performance measurement tools to generate a
suggested custom ordering.
 Available if MPI functions traced (-g mpi or –O apa)
 pat_build –O apa my_program
 see Examples section of pat_build man page
 pat_report options:
 mpi_sm_rank_order
 Uses message data from tracing MPI to generate suggested MPI rank
order. Requires the program to be instrumented using the pat_build -g
mpi option.
 mpi_rank_order
 Uses time in user functions, or alternatively, any other metric specified
by using the -s mro_metric options, to generate suggested MPI rank
order.


 module load xt-craypat
 Rebuild your code
 pat_build –O apa a.out
 Run a.out+pat
 pat_report –Ompi_sm_rank_order a.out+pat+…sdt/ > pat.report
 Creates MPICH_RANK_REORDER_METHOD.x file
 Then set env var MPICH_RANK_REORDER_METHOD=3 AND
 Link the file MPICH_RANK_ORDER.x to MPICH_RANK_ORDER
 Rerun code


Table 1: Suggested MPI Rank Order

Eight cores per node: USER Samp per node
Rank Max Max/ Avg Avg/ Max Node
Order USER Samp SMP USER Samp SMP Ranks
d 17062 97.6% 16907 100.0% 832,328,820,797,113,478,898,600
2 17213 98.4% 16907 100.0% 53,202,309,458,565,714,821,970
0 17282 98.8% 16907 100.0% 53,181,309,437,565,693,821,949
1 17489 100.0% 16907 100.0% 0,1,2,3,4,5,6,7

•This suggests that
1. the custom ordering “d” might be the best
2. Folded-rank next best
3. Round-robin 3rd best
4. Default ordering last


 GYRO 8.0
 B3-GTC problem with 1024 processes
 Run with alternate MPI orderings
 Custom: profiled with with –O apa and used reordering file
MPICH_RANK_REORDER.d

Reorder method Comm. time
Default 11.26s
CrayPAT
0 – round-robin 6.94s suggestion
almost right!
2 – folded-rank 6.68s
d-custom from apa 8.03s


 TGYRO 1.0
 Steady state turbulent transport code using GYRO, NEO, TGLF components
 ASTRA test case
 Tested MPI orderings at large scale
 Originally testing weak-scaling, but found reordering very useful

Reorder TGYRO wall time (min)
method 20480 40960 81920
Huge win!
Default 99m 104m 105m
Round-robin 66m 63m 72m


Time % | Time | Imb. Time | Imb. | Calls |Experiment=1
| | | Time % | |Group
| | | | | Function
| | | | | PE='HIDE'

100.0% | 1530.892958 | -- | -- | 27414118.0 |Total
|---------------------------------------------------------------------
| 52.0% | 796.046937 | -- | -- | 22403802.0 |USER
||--------------------------------------------------------------------
|| 22.3% | 341.176468 | 3.482338 | 1.0% | 19200000.0 |getrates_
|| 17.4% | 266.542501 | 35.451437 | 11.7% | 1200.0 |rhsf_
|| 5.1% | 78.772615 | 0.532703 | 0.7% | 3200000.0 |mcavis_new_looptool_
|| 2.6% | 40.477488 | 2.889609 | 6.7% | 1200.0 |diffflux_proc_looptool_
|| 2.1% | 31.666938 | 6.785575 | 17.6% | 200.0 |integrate_erk_jstage_lt_
|| 1.4% | 21.318895 | 5.042270 | 19.1% | 1200.0 |computeheatflux_looptool_
|| 1.1% | 16.091956 | 6.863891 | 29.9% | 1.0 |main
||====================================================================
| 47.4% | 725.049709 | -- | -- | 5006632.0 |MPI
||--------------------------------------------------------------------
|| 43.8% | 670.742304 | 83.143600 | 11.0% | 2389440.0 |mpi_wait_
|| 1.9% | 28.821882 | 281.694997 | 90.7% | 1284320.0 |mpi_isend_
|=====================================================================


Time % | Time | Imb. Time | Imb. | Calls |Experiment=1
| | | Time % | |Group
| | | | | Function
| | | | | PE='HIDE'

100.0% | 1730.555208 | -- | -- | 16090113.8 |Total
|---------------------------------------------------------------------
| 76.9% | 1330.111350 | -- | -- | 4882627.8 |MPI
||--------------------------------------------------------------------
|| 72.1% | 1247.436960 | 54.277263 | 4.2% | 2389440.0 |mpi_wait_
|| 1.3% | 22.712017 | 101.212360 | 81.7% | 1234718.3 |mpi_isend_
|| 1.0% | 17.623757 | 4.642004 | 20.9% | 1.0 |mpi_comm_dup_
|| 1.0% | 16.849281 | 71.805979 | 81.0% | 1234718.3 |mpi_irecv_
|| 1.0% | 16.835691 | 192.820387 | 92.0% | 19999.2 |mpi_waitall_
||====================================================================
| 22.2% | 384.978417 | -- | -- | 11203802.0 |USER
||--------------------------------------------------------------------
|| 9.9% | 171.440025 | 1.929439 | 1.1% | 9600000.0 |getrates_
|| 7.7% | 133.599580 | 19.572807 | 12.8% | 1200.0 |rhsf_
|| 2.3% | 39.465572 | 0.600168 | 1.5% | 1600000.0 |mcavis_new_looptool_
|=====================================================================
|=====================================================================


MPI Task K +1
MPI Task K
MPI Task K - 1

Differencing in the X direction

MPI Task K +30
MPI Task K MPI Task K +1200
MPI Task K-30 MPI Task K
MPI Task K-1200

Differencing in the Z direction
Differencing in the Y direction


Code must perform one communication across each surface of a cube
12 cubes perform 72 communications, 63 of which go “off node”

Optimized mapping of the MPI tasks on the node
Still performs 72 communications, but now only 32 are off node


Rank Reordering
Case Study
Application data is in
a 3D space, X x Y x Z.

Communication is
nearest-neighbor.

Default ordering
results in 12x1x1
block on each node.

A custom reordering
is now generated:
3x2x2 blocks per
node, resulting in
more on-node
communication


% pat_report -O mpi_sm_rank_order -s rank_grid_dim=8,6 ...

Notes for table 1:

To maximize the locality of point to point communication,
specify a Rank Order with small Max and Avg Sent Msg Total Bytes
per node for the target number of cores per node.

To specify a Rank Order with a numerical value, set the environment
variable MPICH_RANK_REORDER_METHOD to the given value.

To specify a Rank Order with a letter value 'x', set the environment
variable MPICH_RANK_REORDER_METHOD to 3, and copy or link the file
MPICH_RANK_ORDER.x to MPICH_RANK_ORDER.

Table 1: Sent Message Stats and Suggested MPI Rank Order

Communication Partner Counts

Number Rank
Partners Count Ranks

2 4 0 5 42 47
3 20 1 2 3 4 ...
4 24 7 8 9 10 ...

Four cores per node: Sent Msg Total Bytes per node

Rank Max Max/ Avg Avg/ Max Node
Order Total Bytes SMP Total Bytes SMP Ranks

g 121651200 73.9% 86400000 62.5% 14,20,15,21
h 121651200 73.9% 86400000 62.5% 14,20,21,15
u 152064000 92.4% 146534400 106.0% 13,12,10,4
1 164505600 100.0% 138240000 100.0% 16,17,18,19
d 164505600 100.0% 142387200 103.0% 16,17,19,18
0 224640000 136.6% 207360000 150.0% 1,13,25,37
2 241920000 147.1% 207360000 150.0% 7,16,31,40


% $CRAYPAT_ROOT/sbin/grid_order -c 2,2 -g 8,6

# grid_order -c 2,2 -g 8,6
# Region 0: 0,0 (0..47)
0,1,6,7
2,3,8,9
4,5,10,11
12,13,18,19
14,15,20,21
16,17,22,23
24,25,30,31
26,27,32,33
28,29,34,35
36,37,42,43
38,39,44,45
40,41,46,47

This script will also handle the case that cells do not
evenly partition the grid.


X X o o
X X o o
o o o o
o o o o

 Nodes marked X heavily use a shared resource
 If memory bandwidth, scatter the X's
 If network bandwidth to others, again scatter
 If network bandwidth among themselves, concentrate


Call mpi_send(a, 10, …)
Call mpi_send(b, 10, …) Each message incurs latency and library overhead
Call mpi_send(c, 10, …)
Call mpi_send(d, 10, …)

 Copy messages into a contiguous buffer and send once

Sendbuf(1:10) = a(1:10)
Sendbuf(11:20) = b(1:10)
Sendbuf(21:30) = c(1:10)
Sendbuf(31:40) = d(1:10)

Call mpi_send(sendbuf, 40, …) Latency and library overhead
incurred only once

 Effectiveness of this optimization is machine dependent


 Most collectives have been tuned to take advantage of
algorithms and hardware to maximize performance
 MPI_ALLTOALL
 Reorder communications to spread traffic around the network efficiently
 MPI_BCAST/_REDUCE/_ALLREDUCE
 Use tree based algorithms to reduce the number of messages.
 Needs to strike a balance between width and depth of tree.
 MPI_GATHER
 Use tree algorithm to reduce resource contention aggregate messages.

 You don’t want to have to reinvent the wheel


 MPI_ALLTOALL
 Message size decreases as number of ranks grows
 Number of messages is O(num_ranks2)
 Very difficult to scale to very high core counts
 MPI_BCAST/_REDUCE/_ALLREDUCE/_BARRIER
 All are O(log (num_ranks))
 All represent global sync points
 Expose ANY load imbalance in the code
 Expose ANY “jitter” induced by the OS or other services
 MPI_GATHER
 Many-to-one

 The greater the frequency of collectives, the harder it will be to scale

Filesystem Program
 Lustre, GPFS, and Panasas are  Just as a problem gets partitioned
“parallel filesystems” to multiple processors, I/O
 I/O operations are broken down to operations can be done in parallel
basic units and distributed to  MPI-IO is a standard API for doing
multiple endpoints parallel I/O operations
 Spreading out operations in this  By performing I/O operations in
way can greatly improve parallel, an application can reduce
performance at large processor I/O bottlenecks and take
counts advantage of parallel filesystems
 HDF5, NetCDF, and ADIOS all
provide parallel I/O in a portable
file format


 To maximize I/O performance, parallel filesystems
 Break I/O operations into chunks, much like inodes on standard filesystems,
which get distributed among I/O servers
 Provide a means of controlling how much concurrency to use for a given file
 Make the distributed nature of the data invisible to the program/programmer
 File metadata may be distributed (GPFS) or centralized (Lustre)
 In order to take advantage of a parallel filesystem, a user must
 Ensure that multiple processes are sharing I/O duties, one process is incapable
of saturating the filesystem
 Prevent multiple processes from using the same “chunk” simultaneously
(more important with writes)
 Choose a concurrency that is “distributed enough” without spreading data too
thin to be effective (ideally, 1 process shouldn’t need to access several I/O
servers)


 I/O is simply data migration.
 Memory Disk
 I/O is a very expensive operation.
 Interactions with data in memory and on disk.
 Must get the kernel involved
 How is I/O performed?
 I/O Pattern
 Number of processes and files.
 File access characteristics.
 Where is I/O performed?
 Characteristics of the computational system.
 Characteristics of the file system.

J

 There is no “One Size Fits All” solution to the I/O
problem.
 Many I/O patterns work well for some range of
parameters.
 Bottlenecks in performance can occur in many
locations. (Application and/or File system)
 Going to extremes with an I/O pattern will
typically lead to problems.

J

 The best performance comes from situations when the data is accessed
contiguously in memory and on disk.
 Facilitates large operations and minimizes latency.

Memory Disk

 Commonly, data access is contiguous in memory but noncontiguous on disk
or vice versa. Usually to reconstruct a global data structure via parallel I/O.

Memory Disk

277 2011 HPCMP User Group © Cray Inc.
June 20, 2011

 Spokesperson
 One process performs I/O.
 Data Aggregation or
Duplication
 Limited by single I/O
process.
 Pattern does not scale.
 Time increases linearly
with amount of data. Disk
 Time increases with
number of processes.

June 20, 2011

 File per process
 All processes perform I/O to
individual files.
 Limited by file system.
 Pattern does not scale at large
process counts.
 Number of files creates
bottleneck with metadata Disk
operations.
 Number of simultaneous
disk accesses creates
contention for file system
resources.

June 20, 2011

 Shared File
 Each process performs I/O
to a single file which is
shared.
 Performance
 Data layout within the
shared file is very
Disk
important.
 At large process counts
contention can build
for file system
resources.

June 20, 2011

 Subset of processes which perform I/O.
 Aggregation of a group of processes data.
 Serializes I/O in group.
 I/O process may access independent files.
 Limits the number of files accessed.
 Group of processes perform parallel I/O to a shared file.
 Increases the number of shared files to increase file system usage.
 Decreases number of processes which access a shared file to
decrease file system contention.

J

 128 MB per file and a 32 MB Transfer size

File Per Process
Write Performance
12000

10000

8000
Write (MB/s)

1 MB Stripe
6000
32 MB Stripe

4000

2000

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Processes or Files

J

 32 MB per process, 32 MB Transfer size and Stripe size

Single Shared File
Write Performance
8000

7000

6000
Write (MB/s)

POSIX
5000

4000 MPIIO

3000
HDF5

2000

1000

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Processes

J

 Lustre
 Minimize contention for file system resources.
 A process should not access more than one or two OSTs.
 Performance
 Performance is limited for single process I/O.
 Parallel I/O utilizing a file-per-process or a single shared file is limited at
large scales.
 Potential solution is to utilize multiple shared file or a subset of
processes which perform I/O.

J

 Standard Ouput and Error streams are
effectively serial I/O.
 All STDIN, STDOUT, and STDERR I/O
serialize through aprun
 Disable debugging messages when
running in production mode.
 “Hello, I’m task 32000!”
 “Task 64000, made it through Lustre
loop.”

June 20, 2011

 Advantages
 Aggregates smaller read/write
operations into larger operations.
 Examples: OS Kernel Buffer, MPI-IO
Collective Buffering
 Disadvantages Buffer
 Requires additional memory for the
buffer.
 Can tend to serialize I/O.
 Caution
 Frequent buffer flushes can adversely
affect performance.

June 20, 2011

 If an application does extremely small, irregular I/O, explicitly buffering may improve
performance.
 A post processing application writes a 1GB file.
 This case study is an extreme example.
 This occurs from one writer, but occurs in many small write operations.
 Takes 1080 s (~ 18 minutes) to complete.
 IOBUF was utilized to intercept these writes with 64 MB buffers.
 Takes 4.5 s to complete. A 99.6% reduction in time.
Lustre

File "ssef_cn_2008052600f000"
Calls Seconds Megabytes Megabytes/sec Avg Size
Open 1 0.001119
Read 217 0.247026 0.105957 0.428931 512
Write 2083634 1.453222 1017.398927 700.098632 512
Close 1 0.220755
Total 2083853 1.922122 1017.504884 529.365466 512
Sys Read 6 0.655251 384.000000 586.035160 67108864
Sys Write 17 3.848807 1081.145508 280.904052 66686072
Buffers used 4 (256 MB)
Prefetches 6
Preflushes 15

June 20, 2011

 Writing a big-endian binary file with compiler
flag byteswapio
File “XXXXXX"
Calls Megabytes Avg Size
Open 1
Write 5918150 23071.28062 4088
Close 1
Total 5918152 23071.28062 4088

 Writing a little-endian binary
File “XXXXXX"
Calls Megabytes Avg Size
Open 1
Write 350 23071.28062 69120000
Close 1
Total 352 23071.28062 69120000

J

 MPI-IO allows multiple MPI processes to access the same file in a distributed
manner
 Like other MPI operations, it’s necessary to provide a data type for items being
written to the file (may be a derived type)
 There are 3 ways to declare the “file position”
 Explicit offset: each operation explicitly declares the necessary file offset
 Individual File Pointers: Each process has its own unique handle to the file
 Shared File Pointers: The MPI library maintains 1 file pointer and determines
how to handle parallel access (often via serialization)
 For each file position type, there are 2 “coordination” patterns
 Non-collective: Each process acts on its own behalf
 Collective: The processes coordinate, possibly allowing the library to make
smart decisions on how to access the filesystem
 MPI-IO allows the user to provide “hints” to improve I/O performance. Often I/O
performance can be improved via hints about the filesystem or problem-specific
details

int mode, ierr;
char tmps[24];
MPI_File fh;
MPI_Info info; Open a file across all ranks as read/write.
MPI_Status status; Hints can be set between
MPI_Info_create and MPI_File_open.
mode = MPI_MODE_CREATE|MPI_MODE_RDWR;
MPI_Info_create(&info);
MPI_File_open(comm, "output/test.dat", mode, info, &fh);

Set the “view” (offset) for each rank.
MPI_File_set_view(fh, commrank*iosize, MPI_DOUBLE, MPI_DOUBLE, "native",
info);

Collectively write from all ranks.

MPI_File_write_all(fh, dbuf, iosize/sizeof(double), MPI_DOUBLE, &status);

Close the file from all ranks.
MPI_File_close(&fh);


 Several parallel libraries are available to provide a portable, metadata-rich file
format
 On Cray machines, it’s possible to set MPI-IO hints in your environment to improve
out-of-the-box performance
 HDF5 (http://www.hdfgroup.org/HDF5/)
 Has long supported parallel file access
 Currently in version 1.8
 NetCDF (http://www.unidata.ucar.edu/software/netcdf/)
 Multiple parallel implementations of NetCDF exist
 Beginning with version 4.0, HDF5 is used under the hood to provide native
support for parallel file access.
 Currently inversion 4.0.
 ADIOS ( http://adiosapi.org)
 Fairly young library in development by ORNL, GA Tech, and others
 Has a native file format, but also supports POSIX, NetCDF, HDF5, and other file
formats
 Version 1.0 was released at SC09.


 Parallel Filesystems
 Minimize contention for file system resources.
 A process should not access more than one or two OSTs.
 Ideally I/O Buffers and Filesystem “Chunk” sizes should
match evenly to avoid locking
 Performance
 Performance is limited for single process I/O.
 Parallel I/O utilizing a file-per-process or a single shared file
is limited at large scales.
 Potential solution is to utilize multiple shared file or a
subset of processes which perform I/O.
 Large buffer will generally perform best

J

Load the IOBUF module:
% module load iobuf
Relink the program. Set the IOBUF_PARAMS environment variable as needed.
% setenv IOBUF_PARAMS='*:verbose‘
Execute the program.

 IOBUF has a large number of options for tuning behavior from file to file.
See man iobuf for details.
 May significantly help codes that write a lot to stdout or stderr.


 A particular code both reads and writes a 377 GB file. Runs on 6000 cores.
 Total I/O volume (reads and writes) is 850 GB.
 Utilizes parallel HDF5
 Default Stripe settings: count 4, size 1M, index -1.
 1800 s run time (~ 30 minutes)
 Stripe settings: count -1, size 1M, index -1.
 625 s run time (~ 10 minutes)
 Results
 66% decrease in run time.

Lustre

June 20, 2011

 Included in the Cray MPT library.
 Environmental variable used to help MPI-IO optimize I/O performance.
 MPICH_MPIIO_CB_ALIGN Environmental Variable. (Default 2)
 MPICH_MPIIO_HINTS Environmental Variable
 Can set striping_factor and striping_unit for files created with MPI-IO.
 If writes and/or reads utilize collective calls, collective buffering can be
utilized (romio_cb_read/write) to approximately stripe align I/O within
Lustre.

J

MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and
transfers both of 1M bytes and a strided access pattern. Tested on an
XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220
segments, 96 GB file
1800
1600
MB/Sec

1400
1200
1000
800
600
400
200
0


MPI-IO API , non-power-of-2 blocks and transfers, in this case blocks and
transfers both of 10K bytes and a strided access pattern. Tested on an
XT5 with 32 PEs, 8 cores/node, 16 stripes, 16 aggregators, 3220
segments, 96 GB file

160
140
MB/Sec

120
100
80
60
40
20
0


On 5107 PEs, and by application design, a subset of the Pes(88), do the
writes. With collective buffering, this is further reduced to 22 aggregators
(cb_nodes) writing to 22 stripes. Tested on an XT5 with 5107 Pes, 8
cores/node

4000
3500
3000
2500
MB/Sec

2000
1500
1000
500
0


Total file size 6.4 GiB. Mesh of 64M bytes 32M elements, with work divided
amongst all PEs. Original problem was very poor scaling. For example, without
collective buffering, 8000 PEs take over 5 minutes to dump. Note that disabling
data sieving was necessary. Tested on an XT5, 8 stripes, 8 cb_nodes

1000
w/o CB
100 CB=0
Seconds

CB=1
CB=2
10

1

PEs

 Do not open a lot of files all at once (Metadata Bottleneck)
 Use a simple ls (without color) instead of ls -l (OST Bottleneck)
 Remember to stripe files
 Small, individual files => Small stripe counts
 Large, shared files => Large stripe counts
 Never set an explicit starting OST for your files (Filesystem Balance)
 Open Files as Read-Only when possible
 Limit the number of files per directory
 Stat files from just one processes
 Stripe-align your I/O (Reduces Locks)
 Read small, shared files once and broadcast the data (OST Contention)


 Adaptable IO System (ADIOS)
 http://www.olcf.ornl.gov/center-projects/adios/
 “Optimizing MPI-IO for Applications on Cray XT System” (CrayDoc S-0013-
10)
 “A Pragmatic Approach to Improving the Large-scale Parallel I/O
Performance of Scientic Applications.” Crosby, et al. (CUG 2011)


HPCMPUG2011 cray tutorial

More Related Content

What's hot

Similar to HPCMPUG2011 cray tutorial

More from Jeff Larkin

Recently uploaded

HPCMPUG2011 cray tutorial