HPC Essentials 0

HPC Essentials Prequel: From 0 to
HPC in one hour
OR
five ways to do Kriging
Bill Brouwer
Research Computing and Cyberinfrastructure
(RCC), PSU
wjb19@psu.edu

Outline
● Step 0
– Navigating RCC resources
●
Step 1
– Ordinary Kriging in Octave
● Step 2
– Vectorized octave
● Step 3
– Compiled code
●
Digression on Profiling &Amdahl's Law
● Step 4
– Accelerating using GPU
●
Step 5
– Shared Memory
● Step 6
– Distributed Memory
● Scenarios & Summary
wjb19@psu.edu

Step 0
● Get an account on our systems
● Check out the system details, or let us help pick one for you
● They are Linux systems, you'll need some basic commandline knowledge
– You may want to check out HPC Essentials I seminar, Unix/C overview
● We use the modules system for software, you'll need to load what you
use eg., to see a list of everything available:
module av
eg., load octave:
module load octave
To see which modules you have in your environment
module list
wjb19@psu.edu

Step 0
● There are two main types of systems:
– Interactive, share a single machine with one or
more users, including memory and CPUs, used for
● Debugging
● Benchmarking
● Using a program with a graphical user interface
– You'll need to log in using Exceed onDemand
● Running for short periods of time
wjb19@psu.edu

Step 0
● Batch systems
– Get dedicated memory and CPUs for period of time
● Maximum time is generally 24 hours
● Maximum memory and CPUs depends on the cluster
– You log in to a head node, from which you submit a request
eg., an interactive session for 1 node, 1 processor per node
(ppn) and 4gb total memory:
qsub -I -l walltime=24:00:00 -l mem=4gb -l nodes=1:ppn=1
● To check the status of your request:
qstat -u <your_psu_id>
wjb19@psu.edu

Step 0
● Other notes on clusters:
– Please never run anything significant on head nodes, use PBS to
submit a job instead
– If you request more than 1 CPU, remember your code/workflow needs
to be able to either
● Use multiple CPUs on a single node (set ppn parameter) using
some form of shared memory parallelism
● Use multiple CPUs on multiple nodes (set combination node &ppn
parameters) using some form of distributed memory parallelism
● A combination of the above
● Parallelism applied in an optimal way is high performance
computing
wjb19@psu.edu

High Performance Computing
● Using one or more forms of parallelism to
improve the performance and scaling of your
code
– Vector architecture eg., SSE/AVX in Intel CPU
– Shared memory parallelism eg., using multiple
cores of CPU
– Distributed memory parallelism eg., using
Message Passing Interface (MPI) to communicate
between CPUs or GPUs
– Accelerators eg., Graphics Processing Units
wjb19@psu.edu

Typical Compute Node
CPU
IOH
ICH
QuickPath Interconnect
memory bus
RAM
PCI-express
GPU
PCI-e cards
SATA/USB
Direct Media Interface
non-volatile storage
BIOS
ethernet
NETWORK
volatile storage
wjb19@psu.edu

CPU Architecture
● Composed of several complex processing cores, control elements
and high speed memory areas (eg., registers, L3 cache), as well as
vector elements including special registers
wjb19@psu.edu
Core Core Core Core
Cache
Memory Controller
I/O PCIe

Shared + Distributed Memory
Parallelism
● Shared memory parallelism is :
– usually implemented with pThreads or directive based
programming (OpenMP)
– uses one or more cores in CPU
● Distributed memory parallelism is:
– one or more nodes (composed of CPUs + possibly GPUs)
communicating with each other using high speed network eg.,
Infiniband
– network topology and fabric critical to ensuring optimal
communication
wjb19@psu.edu

Nvidia GPU Streaming
Multiprocessor
CUDA core
wjb19@psu.edu
32768x32 bit registers
interconnect
64kB shared mem/L1 Cache
Dispatch unit Dispatch unit
Warp scheduler Warp Scheduler
Special Function Unit x4
Load/Store Unit x16
Core x 16 x 2
Dispatch Port
FPU Int U
Operand Collector
Result Queue
● GPUs run many light-weight threads at once; device composed of
many more (simpler) cores than CPU

Step 1: Prototype your problem
● Pick a numerical scripting language eg., Octave, free
version of matlab
– Solid, well established, linear algebra based
● Code up a solution (eg., we'll consider ordinary kriging)
● Time all scopes/sections of your code to get a feel for
bottlenecks
● You can use the keyboard statement to set
breakpoints in your code for debugging purposes
wjb19@psu.edu

● Kriging is a geospatial statistical method eg.,
predicting rainfall for locations where no
measurements exist, based on surrounding
measurements
● Solution involves:
– constructing Gamma matrix
– solve system of equations for every desired
prediction location
wjb19@psu.edu

function [w,G,g,pred] = krige()
% load input data &output prediction grid
load input.csv; load output.csv;
% init
…
% Gamma; m is size of input space, x,y are coordinates for available data z
for i=1:m
for j=1:m
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))^2+(y(i)-y(j))^2)/3.33));
end
end
% matrix inversion
Ginv = inv(G);
% predictions; n is size of output space, xp,yp are prediction coordinates
% z is available data for x,y coordinates
for i=1:n
g(1:m) = 10.*(1-exp(-sqrt((xp(i)-x).^2+(yp(i)-y).^2)/3.33));
w=Ginv * g';
pred(i) = sum(w(1:m).*z);
end
wjb19@psu.edu

Results 1
● Use tic/toc statements around code blocks for timing; following times are for:
– Initialization
– Gamma construction
– Matrix inversion
– Solution
Octave:1> [a b c d]=krige();
Elapsed time is 0.079224 seconds.
● 80% of the time is spent in constructing the matrix → need to vectorize
● Interpreted languages like Octave benefit from removing loops and replacing
with array operations
– Loops are parsed every iteration by the interpreter
– Vectorizing code by using array operations may take advantage of vector
architecture in CPU
wjb19@psu.edu

Step 2: Vectorize your Prototype
function [w,G,g,pred] = krige()
% load input data &output prediction grid
load input.csv; load output.csv;
% init
…
% Gamma
XI = (ones(m,1)*x)'; YI = (ones(m,1)*y)';
G(1:m,1:m) = 10.*(1-exp(-sqrt((XI-XI').^2+(YI-YI').^2)/3.33));
% matrix inversion
Ginv = inv(G);
% predictions
XP = (ones(m,1)*xp); YP = (ones(m,1)*yp);
XI = (ones(n,1)*x)'; YI = (ones(n,1)*y)';
ZI = (ones(n,1)*z)';
g(1:m,:) = 10.*(1-exp(-sqrt((XP-XI).^2+(YP-YI).^2)/3.33));
w=Ginv * g;
pred = sum(w(1:m,:).*ZI);
wjb19@psu.edu

Results 2
octave:2> [a b c d]=krige();
● Code is more than 15x times faster, for a relatively small investment
● Vectorized code will have a higher memory overhead, due to the creation of
temporary arrays; harder to read too :)
● When memory or compute time become unacceptable, no choice but to move
to compiled code
● C/C++ are logical choices in a Linux environment
– Very stable, heavily used, Linux OS itself is written in C
– Expressive languages containing many innovations, algorithms and data
structures
– C++ is object oriented, allows for design of large sophisticated projects
wjb19@psu.edu

Step 3 : Compiled Code
● Unlike a scripted language, C/C++ must be compiled to run on the CPU,
converting a human readable language into machine code
● Several compilers are available on the clusters including Intel, PGI and
the GNU compiler collection
● In compilation and linking steps we must specify headers (with
interfaces) and libraries (with functions) need by our application
● Try to avoid reinventing the wheel, always use available libraries if you
can instead of reimplementing algorithms, data structures
● As opposed to scripting, now responsible for memory management eg.,
allocating on the heap (dynamically at runtime) or on the stack
(statically at compile time)
wjb19@psu.edu

● In porting Octave/Matlab code to C/C++ you should always consider using
these libraries at least:
– Armadillo, C++ wrappers for BLAS/LAPACK, syntax very similar to
Octave/Matlab
– BLAS/LAPACK itself
● BLAS==Basic Linear Algebra
● LAPACK==Linear Algebra PACKage
● Both come in many optimized flavors eg., Intel MKL
● If you want to know more about Linux basics including writing/compiling C
code, you could check out HPC Essentials I
● If you want to know more about C++, you could check out HPC Essentials V
wjb19@psu.edu

#include "armadillo"
#include <mkl.h>
#include <iostream>
using namespace std;
using namespace arma;
int main(){
mat G; vec g;
//load data, initialize variables, calculate Gamma
for (int i=0; i<m; i++)
for (int j=0; j<m; j++){
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))
+(y(i)-y(j))*(y(i)-y(j)))/3.33));
}
char uplo = 'U'; int N = m+1; int info;
int * ipiv = new int[N]; double * work = new double[3*N];
// factorize using the LU decomp. routine from LAPACK
dgetrf(&N, &N, G.memptr(), &N, ipiv, &info);
//solve
int nrhs=1; char trans='N';
for (int i=0; i<n; i++){
g.rows(0,m-1) = ...
dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
pred(i,0)=dot(z,g.rows(0,m-1));
…
} wjb19@psu.edu

Results 3
● Compiled code is comparable in speed to vectorized code, although we could
make some algorithmic changes to improve further:
– The Gamma matrix is symmetric, no need to calculate values for j >= i (ie.,
just calculate/store a triangular matrix)
– Calculating the inverse is expensive and inaccurate, better to (for eg.,)
factorize a matrix and use direct solve eg., using forward/backward
substitution (we did do this, but using full matrix/LU decomp.)
– Armadillo uses operator overloading &expression templates to allow a
vectorized approach to programming, although we leave loops in for the
moment, to allow parallelization later
● If you have bugs in your code, use gdb to debug
● Always profile completely in order to solve all issues and get a complete
handle on your code
wjb19@psu.edu

Important Code Profiling Methods
● Solving memory leaks; use valgrind
● Poor memory access patterns/cache usage
– Use valgrind --tool=cachegrind to assess cache hits +
misses
● Heap memory usage
– Memory management has performance impact, assess with
valgrind --tool=massif
● And before you consider moving to parallel, develop a call
profile for your code eg., in terms of total instructions executed
for each scope, using valgrind --tool=callgrind
wjb19@psu.edu

Amdahl's Law
●
The problems in science we seek to solve are becoming increasingly
large, as we go down in scale (eg., quantum chemistry) or up (eg.,
astrophysics)
●
As a natural consequence, we seek both performance and scaling in our
scientific applications, thus we parallelize as we run out of resources using
a single processor
● We are limited by Amdahl's law, an expression of the maximum
improvement of parallel code over serial:
1/((1-P) +P/N)
where
P is the portion of application code we parallelize, and N is the number of
processors ie., as N increases, the portion of remaining serial code becomes
increasingly expensive, relatively speaking
wjb19@psu.edu

Amdahl's Law
● Unless the portion of code we can parallelize approaches 100%,we see
rapidly diminishing returns with increasing numbers of processors
wjb19@psu.edu

Step 4 : Accelerate
● In general not all algorithms are amenable, and there is the
communication bottleneck between CPU and GPU to overcome
● However, linear algebra operations are extremely efficient on GPU,
you can expect 2-10x over a whole CPU socket (ie., running all cores)
for many operations
● The language for programming Nvidia series GPUs is CUDA; much
like C but you need to know the architecture well and/or:
– Use libraries like cuBLAS (what we'll try)
– Use directive based programming in the form of openACC
– Use the OpenCL language (cross platform, but not heavily supported by Nvidia
like CUDA)
wjb19@psu.edu

Step 4 : Accelerate
#include <mkl.h>
#include <iostream>
#include <cuda.h>
using namespace std;
using namespace arma;
int main(){
mat G; vec g;
//load data, initialize variables, calculate Gamma as before
//factorize using the LU decomp. routine from LAPACK, as before
//allocate memory on GPU and transfer data
//solve on gpu; two steps, solve two triangular systems
cublasDtrsm(...);
cublasDtrsm(...);
//free memory on GPU and transfer data back
}
wjb19@psu.edu

Results 4
● Minimal code changes, recompilation using nvcc compiler, available by loading
any CUDA module on lion-GA (where you'll also need to run)
● We still perform matrix factorization on CPU side, move data to GPU for
performing solve in two steps
● This overall solution is roughly 6x the single CPU thread solution presented
previously, for larger data sizes
● General rule of thumb → minimize communication btwn CPU + GPU, use GPU
when you can occupy all SMPs per device, don't bother for small problems,
cost of communication outweighs benefits
● There is ongoing work performed in porting LAPACK routines to GPU eg.,
check out our LU/QR work, or the significant MAGMA project from UT/ORNL
● If you're interested in trying CUDA and GPUs further, you could check out HPC
Essentials IV
wjb19@psu.edu

Step 5: Shared memory
●
We've determined through profiling that it's worthwhile parallelizing our loops
● By linking against Intel MKL we also have access to threaded functions
● Will simply use OpenMP directive based programming for this example
● We are generally responsible for deciding what variables need to be shared
by threads, and which variables should be privately owned by threads
● If we fail to make these distinctions where needed, we end up with race
conditions
– Threads operate on data in an uncoordinated fashion ,and data elements have
unpredictable/erroneous values
● Outside the scope of this talk, but just as pernicious is deadlock, when
threads (and indeed whole programs) hang due to improper coordination
wjb19@psu.edu

Step 5 : Shared Memory
#include <mkl.h>
#include <iostream>
#include <omp.h>
...
int main(){
...
//load data, initialize variables, calculate Gamma
#pragma omp parallel for
for (int i=0; i<m; i++)
for (int j=0; j<m; j++){
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))
+(y(i)-y(j))*(y(i)-y(j)))/3.33));
}
// factorize using the LU decomp. routine from LAPACK
dgetrf(&N, &N, G.memptr(), &N, ipiv, &info);
//initialize data for solve, for all right hand sides
#pragma omp parallel for
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
g(i,j) = ...
//multithreaded solve for all RHS
dgetrs(&trans,&N,&n,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
//assemble predictions
wjb19@psu.edu

Results 5
● In linking, must specify -fopenmp if using GNU compiler,
or -openmp for Intel
● At runtime, need to export the environment variable
OMP_NUM_THREADS to the desired number
● Exporting this number to something beyond the total
number of cores you have access to will result in severe
performance degradation
● Outside the scope of this talk, but often need to tune CPU
affinity for best performance
● For more information, please check out HPC Essentials II
wjb19@psu.edu

Step 6 : Distributed Memory
● A good motivation for moving to distributed memory is, in a simple case,
a shortage of memory on a single node
● From a practical perspective, scheduling distributed CPU cores is easier
than shared memory cores ie., your PBS queuing time is shorter :)
● We will use the message passing interface (MPI), a venerable standard
developed over the last 20 years or so, with language bindings for C
and fortran
● On the clusters, we use OpenMPI (not to be confused with OpenMP);
once you load the module, by using the wrapper compilers, compilation
and linking paths are taken care of for you
● Aside from needing to link with other libraries like Intel MKL, compiling
and linking a C++ MPI program can be as simple as :
module load openmpi
mpic++ my_program.cpp
wjb19@psu.edu

Step 6 : Distributed Memory
#include <mkl.h>
#include <iostream>
#include <mpi.h>
int main(int argc, char * argv[]){
int rank, size;
MPI_Status status;
MPI_Init(&argc, &argv);
// size== total processes in this MPI_COMM_WORLD pool
MPI_Comm_size(MPI_COMM_WORLD, &size);
// rank== my identifier in pool
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// load data, initialize variables, calculate Gamma, perform factorization
// solve just for my portion of predictions
int lower = (rank * n) / size;
int upper = ((rank+1) * n) / size)-1;
for (int i=lower; i<upper; i++){
g.rows(0,m-1) = ...
dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
pred(i,0)=dot(z,g.rows(0,m-1));
…
}
//gather results back to root process
wjb19@psu.edu

Results 6
● When you run a MPI job using PBS, you need to use the mpirun script to
setup your environment and spawn processes on the different CPUs allocated
to you by the scheduler:
mpirun my_application.x
●
Here we simply divided the output space between different processors ie.,
each processor in the pool calculated a portion of the predictions
●
However a collective call was needed (not shown) after the solve steps, a
gather statement to bring all the results to the root process (with rank 0)
● This was the only communication between different processes throughout the
calculation ie., this was close to embarrassingly parallel → no communication,
great scaling with processors
● Despite the high bandwidths available on modern networks, the cost of
latency is generally the limiting factor is using distributed memory parallelism
● For more on MPI you could check out HPC Essentials III
wjb19@psu.edu

Review
● Let's review some of the things we've
discussed
● I'll splash up several scenario's and we'll
attempt to score them
wjb19@psu.edu

Score Card
Score What this feels like Your HPC vehicle
+5 Civilized society Something German
+4 Evening with friends American Muscle
+3 Favorite show renewed A Honda
+2 Twinkies are back Sonata
+1 A fairy gets its wings Camry
0 meh Corolla
-1 A fairy dies Neon
-2 Twinkies are gone Pinto
-3 Favorite show canceled Le Barron
-4 Evening with Facebook Yugo
-5 Zombie Apocalypse Abrams tank
wjb19@psu.edu

Scenario 1
● You get an account for hammer, maybe install
and use Exceed onDemand, load and use the
Matlab/Octave module after logging in
wjb19@psu.edu

Scenario 1
● Score : 0
● Meh. You'll run a little faster, probably have
more memory. But this isn't HPC and you
could almost do this on your laptop. You're
driving a Corolla, doing 45 mp/h in the fast
lane.
wjb19@psu.edu

Scenario 2
● You vectorize your loops and/or create a
compiled MEX (Matlab) or OCT (Octave)
function
wjb19@psu.edu

Scenario 2
● A fairy gets its wings! You move up to the Camry!
● By vectorizing loops you use internal functions that are
interpreted once at runtime, and under the hood may even
get to utilize the vector architecture of the CPU.
● Tricky loops eg., those with conditionals are best converted to
MEX/OCT functions eg., for OCTAVE you want the
mkoctfile utility
● If compiling new functions, don't forget to link with HPC
libraries eg., Intel MKL or AMD ACML where possible.
wjb19@psu.edu

Scenario 3
● Instead of submitting a PBS job you do all this
on the head node of a batch cluster
wjb19@psu.edu

Scenario 3
● A fairy dies! You drive a Neon at 35 mp/h in the HPC fastlane!
●
Things could be worse for you, but using memory and CPU on head
nodes can grind processes like parallel filesystems to a halt, making
other users and sys admin feel downright melancholy. Screens
freeze, commands return at the speed of pitchblende.
● If you need dedicated resources and/or to run for more than a few
minutes, please use an interactive cluster or PBS :
https://rcc.its.psu.edu/user_guides/system_utilities/pbs/
wjb19@psu.edu

Scenario 4
● You use Armadillo to port your Matlab/Octave
code to C++, and use version control to
manage your project (eg., SVN, git/github)
wjb19@psu.edu

Scenario 4
● Twinkies are back! You think Hyundai finally have it
together and splash out on the Sonata!
● Vectorized Octave/Matlab code is hard to beat.
However you may wish to scale outside the node
someday, integrate into an existing C++ project or
perhaps use rich C++ objects (found in Boost for
eg.,) so this is the way to go. Actually there are
myriad reasons.
● Don't forget to compile first with '-Wall -g' options,
then when it's working and you get the right answer,
optimize!
wjb19@psu.edu

Scenario 5
● You port your Matlab/Octave code to C++
without use of libraries or version control
wjb19@psu.edu

Scenario 5
● No twinkies! You drive a pinto that bursts into flames
immediately!
● Reinventing the wheel is a very bad, time
consuming idea. Armadillo uses expression
templates to create very efficient code at compile
time, without it you could end up with an inefficient
mess.
● Neglect to use version control and you will surely
regret it. Probably right around a publication
deadline too. And while we're on the topic please
backup your data.
wjb19@psu.edu

Scenario 6
● You target sections of your version controlled
C++ code for acceleration, after understanding
it better by profiling using valgrind
--tool=callgrind
wjb19@psu.edu

Scenario 6
● Score : +3
● Futurama is back! You get a new civic!
● Believe the hype, GPUs are here to stay and will
accelerate many algorithms, especially linear algebra.
● Take advantage of libraries like CUBLAS before rolling
your own code, check in at the CUDAZONE to see what
applications and code examples exist already. Get familiar
with CUDA, we are an Nvidia CUDA Research Center :
https://research.nvidia.com/content/penn-state-crc-
summary
wjb19@psu.edu

Scenario 7
● Your non-version controlled C++ code has bad
memory access patterns, memory leaks, creates
many temporaries.
● Score : -3
● Bye-Bye Futurama ! Hello Le Barron!
● Ignore good memory and cache access patterns at
your peril
● Use valgrind (default) and valgrind --tool=cachegrind
to learn more. Avoid temporaries by using libraries
like Armadillo, or learning and using expression
templates.
wjb19@psu.edu

Scenario 8
● Scenario 6 and you introduce shared memory parallelism using
OpenMP. You look into and tune CPU affinity.
● Score : +4
● You provide Babette's feast for your friends and elicit a
penchant for the Ford mustang.
● OpenMP is relatively easy eg., a pragma around a for loop.
● Don't forget to check thread performance with valgrind
--tool=helgrind
● Now your code is a thing of beauty, properly version controlled,
profiled completely (well you could run massif as well) and
you're able to use all the compute hardware in a single
heterogeneous node.
wjb19@psu.edu

Scenario 9
● Scenario 7 AND you decide to thrash disk. Plus you try to
write >= 1M files
● Score : -4
● Yugo is only cool in that Portlandia bit, and Facebook was
only good for a brief period in 2006.
● Disk I/O kills in a HPC context, plus the maximum file limit at
time of writing is 1M
– You give control to the kernel and your application ceases to
execute for some time (a voluntary context switch)
– You might be contending for disk with other processes
– You introduce the lower memory bandwidth (BW) and higher
latency (Delta) of disk versus system memory
– Parallel filesystems → all of the above plus network BW and Delta
wjb19@psu.edu

Scenario 10
● Scenario 8 AND you decide to scale outside the
node with MPI. You look into Patterns. GOF is on
the nightstand.
● Score : +5
● You are a cultured individual and you drive a
German vehicle. You care about engineering.
● Don't forget Amdahl's law
● Even with IB networks, minimize communication,
consider new paradigms in distributed memory
parallelism (check out MPI revision 3).
wjb19@psu.edu

Scenario 11
● Scenario 9 and you do it all on the head node,
including OpenMP for 1% of your loops. You also
export OMP_NUM_THREADS=20 and you have
10 cores. There's no coordination between
threads, races all over the place. You have
about 40 MPI processes trying to read the same
file as well, without parallel file I/O.
wjb19@psu.edu

Scenario 11
● Score : -5
● The end is nigh and you're taking out zombies and
HPC infrastructure in your Abrams tank, moving at
1mph, getting 0.2 miles to the gallon
● You ignored all the other advice, and now you throw
out Amdahl's law too.
● AND you have no coordination between any of your
threads or processes.
● AND you're trying to run more threads and processes
than the system can support concurrently, so context
switching takes place furiously.
● Expect a not-so-rosy email from sys admin :-)wjb19@psu.edu

Summary
● High performance computing is leveraging one or more
forms of parallelism in a performant way
● Often the best gains come from writing vectorized octave
code, or making algorithmic changes
● Before you parallelize, fully profile your code and keep
Amdahl's law in mind
● All forms of parallelism have their limitations, but in
general:
– GPU accelerators are excellent for linear algebra
– Shared memory using OpenMP works well for simple, nested
loops
– Consider using MPI (distributed memory) for 'big data', but
limit communication wjb19@psu.edu

HPC Essentials 0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HPC Essentials 0

Similar to HPC Essentials 0 (20)

Recently uploaded

Recently uploaded (20)

HPC Essentials 0