HPC Essentials


Published on

Seminar series delivered at PSU on high performance computing

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

HPC Essentials

  1. 1. HPC Essentials Part I : UNIX/C Overview Bill BrouwerResearch Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  2. 2. Outline●Introduction ● Hardware ● Definitions ● UNIX ● Kernel & shell●Files ● Permissions ● Utilities ● Bash Scripting●C programming wjb19@psu.edu
  3. 3. HPC IntroductionHPC systems composed of :● ● Software ● Hardware ● Devices (eg., disks) ● Compute elements (eg., CPU) ● Shared and/or distributed memory ● Communication (eg., Infiniband network)●A HPC system ...isnt... unless hardware is configured correctly andsoftware leverages all resources made available to it, in an optimalmanner●An operating system controls the execution of software on the hardware;HPC clusters almost exclusively use UNIX/Linux●In the computational sciences, we pass data and/or abstractions througha pipelined workflow; UNIX is the natural analogue to thissolving/discovery process wjb19@psu.edu
  4. 4. UNIX●UNIX is a multi-user/tasking OS created by Dennis Ritchie and KenThompson at AT&T Bell Labs 1969-1970, written primarily in C language(also developed by Ritchie)UNIX is composed of :● ● Kernel ● OS itself which handles scheduling, memory management, I/O etc ● Shell (eg., Bash) ● Interacts with kernel, command line interpreter ● Utilities ● Programs run by the shell, tools for file manipulation, interaction with the system ● Files ● Everything but process(es), composed of data... wjb19@psu.edu
  5. 5. Data-Related Definitions●Binary ● Most fundamental data representation in computing, base 2 number system (others; hex → base 16, oct → base 8)●Byte ● 8 bits = 8b = 1Byte = 1B; 1kB = 1024 B; 1MB = 1024 kB etc●ASCII ● American Standard Code for Information Interchange; character encoding scheme, 7bits (traditional) or 8bits (UTF-8) per character, a Unicode encoding●Stream ● A flow of bytes; source → stdout (& stderr), sink → stdin●Bus ● Communication channel over which data flows, connects elements within a machine●Process ● Fundamental unit of computational work performed by a processor; CPU executes application or OS instructions●Node ● Single computer, composed of many elements, various architectures for CPU, eg., x86, RISC wjb19@psu.edu
  6. 6. Typical Compute Node (Intel i7) RAM CPU memory bus QuickPath Interconnect GPU IOH volatile storage PCI-express Direct Media Interface ethernetPCI-e cards ICH NETWORK SATA/USB BIOS non-volatile storage wjb19@psu.edu
  7. 7. More Definitions●Cluster ● Many nodes connected together via network●Network ● Communication channel, inter-node; connects machines●Shared Memory ● Memory region shared within node●Distributed Memory ● Memory region across two or more nodes●Direct Memory Access (DMA) ● Access memory independently of programmed I/O ie., independent of the CPU●Bandwidth ● Rate of data transfer across serial or parallel communication channel, expressed as bits (b) or Bytes (B) per second (s) ● Beware quotations of bandwidth; many factors eg., simplex/duplex, peak/sustained, no. of lanes etc ● Latency or the time to create a communication channel is often more important wjb19@psu.edu
  8. 8. Bandwidths●Devices ● USB : 60MB/s (version 2.0) ● Hard Disk : 100MBs-500MB/s ● PCIe : 32GB/s (x8, version 2.0)●Networks ● 10/100Base T : 10/100 Mbit/s ● 1000BaseT (1GigE) : 1000 Mbit/s ● 10 GigE : 10 Gbit/s ● Infiniband QDR 4X: 40 Gbit/s●Memory ● CPU : ~ 35 GB/s (Nehalem, 3x 1.3GHz DIMM/socket)* ● GPU : ~ 180 GB/s (GeForce GTX 480)●AVOID devices, keep data resident in memory, minimize communicationbtwn processes●MANY subtleties to CPU memory management eg., with 8x CPU cores,total bandwidth may be > 300 GB/s or as little as 10 GB/s, will discussfurther*http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations?t=anon#fbid=XZRzflqVZ6J wjb19@psu.edu
  9. 9. Outline●Introduction ● HPC hardware ● Definitions ● UNIX ● Kernel & shell●Files ● Permissions ● Utilities ● Bash Scripting●C programming wjb19@psu.edu
  10. 10. UNIX Permissions & Files●At the highest level, UNIX objects are either files or processes, and bothare protected by permissions (processes next time)●Every file object has two IDs, the user and group, both are assigned oncreation; only the root user has unrestricted access to everything●Files also have bits which specify read (r), write (w) and execute (x)permissions for the user, group and others eg., output of ls command: ­rw­r­­r­­ 1 root root 0 Jun 11 1976 /usr/local/foo.txt user/group/others User ID Group ID filename●We can manipulate files using myriad utilities, these utilities are commandsinterpreted by the shell and executed by the kernel●To learn more, check man pages ie., from the command line man <command> wjb19@psu.edu
  11. 11. File Manipulation IWorking from the command line in a Bash shell:●List directory foo_dir contents, human readable :●[wjb19@lionga scratch] $ ls ­lah foo_dirChange ownership of foo.xyz to wjb19; group and user:●[wjb19@lionga scratch] $ chown wjb19:wjb19 foo.xyz●Add execute permission to foo.xyz:[wjb19@lionga scratch] $ chmod +x foo.xyz●Determine filetype for foo.xyz:[wjb19@lionga scratch] $ file foo.xyz●Peruse text file foo.xyz:[wjb19@lionga scratch] $ more foo.xyz wjb19@psu.edu
  12. 12. File Manipulation II●Copy foo.txt from lionga to file /home/bill/foo.txt on dirac :[wjb19@lionga scratch] $ scp foo.txt  wjb19@dirac.rcc.psu.edu:/home/bill/foo.txtCreate gzip compressed file archive of directory foo and contents :●[wjb19@lionga scratch] $ tar ­cfz foo_archive.tgz foo/*Create bzip2 compressed file archive of directory foo and contents :●[wjb19@lionga scratch] $ tar ­cfj foo_archive.tbz foo/*Unpack compressed file archive :●[wjb19@lionga scratch] $ tar ­xvf foo_archive.tgzEdit a text file using VIM:●[wjb19@lionga scratch] $ vim foo.txt●VIM is a venerable and powerful command line editor with a rich set ofcommands wjb19@psu.edu
  13. 13. Text File Edit w/ VIM●Two main modes of operation; editing or command. From command, switch to edit byissuing a (insert after cursor) or i (before), switch back to command via <ESC> Save w/o quitting :w<ENTER> Save and quit (ie., <shift> AND z AND z) :wq<ENTER> Quit w/o saving :q!<ENTER> Delete x lines eg,. x=10 (also stored in clipboard) d10d Yank (copy) x lines eg., x=10 y10y Split screen/buffer :split<ENTER> Switch window/buffer <CNTRL>­w­w Go to line x eg., x=10 :10<ENTER> Find matching construct (eg., from { to }) % ● Paste: p undo: u redo: <CNTRL>­r ● Move up/down one screen line : ­ and + ● Search for expression exp, forward (n or N navigate up/down highlighted matches) /exp<ENTER> or backward ?exp<ENTER>  wjb19@psu.edu
  14. 14. Text File Compare w/ VIMDIFF●Same commands as VIM, but highlights differences in files, allows transfer oftext btwn buffers/files; launch with vimdiff foo.txt foo2.txt●Push text from right to left (when right window active and cursor in relevantregion) using command dp●Pull text from right to left (when left window active and cursor in relevantregion) using command do wjb19@psu.edu
  15. 15. Bash Scripting●File and other utilities can be assembled into scripts, interpreted by theshell eg., Bash●The scripts can be collections of commands/utilities & fundamentalprogramming constructsCode Comment #this is a commentPipe stdout of procA to stdin of procB procA | procBRedirect stdout of procA to file foo.txt* procA > foo.txtCommand separator procA; procBIf block if [condition] then procA fiDisplay on stdout echo “hello”Variable assignment & literal value a = “foo”; echo $aConcatenate strings b=a.“foo2”;Text Processing utilities sed,gawkSearch utilities find,grep*Streams have file descriptors (numbers) associated with them; eg., to redirect stderrfrom procA to foo.txt → procA 2> foo.txt wjb19@psu.edu
  16. 16. Text Processing●Text documents are composed of records (roughly speaking, linesseparated by carriage returns) and fields (separated by spaces)●Text processing using sed & gawk involves coupling patterns withactions eg., print field 1 in document foo.txt when encountering wordimage:[wjb19@lionga scratch] $ gawk /image/ {print $1;} “foo.txt”  pattern action input●Parse, without case sensitivity, change from default space fieldseparator (FS) to equals sign, print field 2:[wjb19@lionga scratch] $ gawk BEGIN{IGNORECASE=1; FS=”=”}  /image/ {print $2;} “foo.txt”● Putting it all together → create a Bash script w/ VIM or other (eg,. Pico)... wjb19@psu.edu
  17. 17. Bash Example I#!/bin/bash Run using bash#set source and destination pathsDIR_PATH=~/scratch/espresso­PRACE/PWBAK_PATH=~/scratch/PW_BAKdeclare ­a file_list Declare an array#filenames to arrayfile_list=$(ls ­l ${BAK_PATH} | gawk /f90/ {print $9}) Command outputcnt=0;#parse files & pretty upfor x in $file_listdo    let "cnt+=1"    sed s/,&/, &/g $BAK_PATH/$x |     sed s/)/) /g |     sed s/call/ call /g |  Search & replace    sed s/CALL/ call /g > $DIR_PATH/$x echo cleaned file no. $cnt $xdoneexit wjb19@psu.edu
  18. 18. Bash Example II#!/bin/bashif [ $# ­lt 6 ] Total argumentsthen echo usage: fitCPCPMG.sh [/path/and/filename.csv]  [desired number of gaussians in mixture (2­10)]   [no. random samples (1000­10000)] [mcmc steps (1000­30000)]  [percent noise level (0­10)] [percent step size (0.01­20)] [/path/to/restart/filename.csv; optional]    exitfiext=${1##*.} File extensionif [ "$ext" != "csv" ]then        echo ERROR: file must be *.csv        exitfibase=$(basename $1 .csv) File basenameif [[ $2 ­lt 2 ]] || [[ $2 ­gt 10 ]]then  echo "ERROR: must specify 2<=x<=10 gaussians in mixture" exitfi wjb19@psu.edu
  19. 19. Outline●Introduction ● HPC hardware ● Definitions ● UNIX ● Kernel & shell●Files ● Permissions ● Utilities ● Bash Scripting●C programming wjb19@psu.edu
  20. 20. The C Language●Utilities, user applications and indeed the UNIX OS itself are executed by theCPU, when expressed as machine code eg., store/load from memory, additionetc●Fundamental operations like memory allocation, I/O etc are laborious toexpress at this level, most frequently we begin from a high-level language like C●The process of creating an executable consists of at least 3 fundamental steps;creation of source code text file containing all desired objects and operations,compilation and linking eg,. using the GNU tool gcc to create executable foo.xfrom source file foo.c:[wjb19@tesla2 scratch]$ gcc ­std=c99 foo.c ­o foo.x *C99 standard Executable compile link Source *c Object *o file code Library objects wjb19@psu.edu
  21. 21. C Code Elements I●Composed of primitive datatypes (eg., int, float, long), whichhave different sizes in memory, multiples of 1 byte●May be composed of statically allocated memory (compile time),dynamically allocated memory (runtime), or both●Pointers (eg., float *) are primitives with 4 or 8 byte lengths (32bit or64bit machines) which contain an address to a contiguous region ofdynamically allocated memory●More complicated objects can be constructed from primitives and arrayseg., a struct wjb19@psu.edu
  22. 22. C Code Elements II●Common operations are gathered into functions, the most commonbeing main(), which must be present in executable●Functions have a distinct name, take arguments, and return output; thisinformation comprises the prototype, expressed separately to theimplementation details, former often in header file●Important system functions include read,write,printf (I/O) andmalloc,free (Memory)●The operating system executes compiled code; a running program is aprocess (more next time) wjb19@psu.edu
  23. 23. C Code Example#include <stdio.h>#include <stdlib.h> Tells preprocessor to#include "allDefines.h" include these headers;//Kirchoff Migration function in psktmCPU.c system functions etcvoid ktmMigrationCPU(struct imageGrid* imageX,        struct imageGrid* imageY,        struct imageGrid* imageZ,        struct jobParams* config,        float* midX, Function prototype;        float* midY, must give arguments,        float* offX, their types and return        float* offY, type; implementation        float* traces, elsewhere        float* slowness,        float* image);int main(){ int IMAGE_SIZE = 10; float* image = (float*) malloc (IMAGE_SIZE*sizeof(float)); printf(“size of image = %in”,IMAGE_SIZE); for (int i=0; i<IMAGE_SIZE; i++) printf(“image point %i = %fn”,i,image[i]); free(image); return 0;} wjb19@psu.edu
  24. 24. UNIX C Good Practice I●Use three streams, with file descriptors 0,1,2 respectively, allowsassembly of operations into pipeline and these data streams arecheap to use●Only hand simple command line options to main() usingargc,argv[]; in general we wish to handle short and long options(eg., see GNU coding standards) and the use of getopt_long()is preferable.●Utilize the environment variables of the host shell, particularly insetting runtime conditions in executed code via getenv() eg., inBash set in .bashrc config file or via command line:[wjb19@lionga scratch] $ export MY_STRING=hello●If your project/program requires a) sophisticated objects b) manydevelopers c) would benefit from object oriented design principles, youshould consider writing in C++ (although being a higher-level language it isharder to optimize) wjb19@psu.edu
  25. 25. UNIX C Good Practice II●In high performance applications, avoid system calls eg.,read/write where control is given over to the kernel and processescan be blocked until the resource is ready eg., disk ● IF system calls must be used, handle errors and report to stderr ● IF temporary files must be written, use mkstemp which sets permissions , followed by unlink; the file descriptor is closed by the kernel when the program exists and the file removed●Use assert to test validity of function arguments, statements etc;will introduce performance hit, but asserts can be removed at compiletime with NDEBUG macro (C standard)●Debug with gdb, profile with gprof, valgrind; target mostexpensive functions for optimizationPut common functions in/use libraries wherever possible....● wjb19@psu.edu
  26. 26. Key HPC LibrariesBLAS/LAPACK/ScaLAPACK● ● Original basic and extended linear algebra routines ● http://www.netlib.org/Intel Math Kernel Library (MKL)● ● implementation of above routines, w/ solvers, fft etc ● http://software.intel.com/en-us/articles/intel-mkl/AMD Core Math Library (ACML)● ● Ditto ● http://developer.amd.com/libraries/acml/pages/default.aspxOpenMPI● ● Open source MPI implementation ● http://www.open-mpi.org/PETSc● ● Data structures and routines for parallel scientific applications based on PDEs ● http://www.mcs.anl.gov/petsc/petsc-as/ wjb19@psu.edu
  27. 27. UNIX C Compilation I●In general the creation and use of shared libraries (*so) is preferable tostatic (*a), for space reasons and ease of software updatesProgram in modules and link separate objects●●Use ­fPIC flag in shared library compilation; PIC==positionindependent, code in shared object does not depend on address/locationat which it is loaded.Use the make utility to manage builds (more next time)●●Dont forget to update your PATH and LD_LIBRARY_PATH env vars w/your binary executable path & any libraries you need/created for theapplication, respectively wjb19@psu.edu
  28. 28. UNIX C Compilation II●Remember in compilation steps to ­I/set/header/paths and keepinterface (in headers) separate from implementation as much as possible●Remember in linking steps for shared libs to: ● ­L/set/path/to/library AND ● set flag ­lmyLib, where ● /set/path/to/library/libmyLib.so must existotherwise you will have undefined references and/or cant find ­lmyLib etcCompile with ­Wall or similar and fix all warnings●Read the manual :)● wjb19@psu.edu
  29. 29. Conclusions●High Performance Computing Systems are an assembly of hardware andsoftware working together, usually based on the UNIX OS; multiple computenodes are connected togetherThe UNIX kernel is surrounded by a shell eg., Bash; commands and constructs●may be assembled into scripts●UNIX, associated utilities and user applications are traditionally written in high-level languages like C●HPC user applications may take advantage of shared or distributed memorycompute models, or both●Regardless, good code minimizes I/O, keeps data resident in memory for aslong as possible and minimizes communication between processes●User applications should take advantage of existing high performance libraries,and tools like gdb, gprof and valgrind wjb19@psu.edu
  30. 30. References●Dennis Ritchie, RIP ● http://en.wikipedia.org/wiki/Dennis_Ritchie●Advanced bash scripting guide ● http://tldp.org/LDP/abs/html/●Text processing w/ GAWK ● http://www.gnu.org/s/gawk/manual/gawk.html●Advanced Linux programming ● http://www.advancedlinuxprogramming.com/alp-folder/●Excellent optimization tips ● http://www.lri.fr/~bastoul/local_copies/lee.html●GNU compiler collection documents ● http://gcc.gnu.org/onlinedocs/●Original RISC design paper ● http://www.eecs.berkeley.edu/Pubs/TechRpts/1982/CSD-82-106.pdf●C++ FAQ ● http://www.parashift.com/c++-faq-lite/●VIM Wiki ● http://vim.wikia.com/wiki/Vim_Tips_Wiki wjb19@psu.edu
  31. 31. Exercises●Take supplied code and compile using gcc, creating executablefoo.x; attempt to run as ./foo.x●Code has a segmentation fault, an error in memory allocation which ishandled via the malloc function●Recompile with debug flag ­g, run through gdb and correct the sourceof the segmentation fault●Load the valgrind module ie., module load valgrind andthen run as valgrind ./foo.x; this powerful profiling tool willhelp identify memory leaks, or memory on the heap* which has not beenfreed●Write a Bash script that stores your home directory file contents in anarray and : ● Uses sed to swap vowels (eg., a and e) in names ● Parses the array of names and returns only a single match, if it exists, else echo NO­MATCH*heap== region of dynamically allocated memory wjb19@psu.edu
  32. 32. GDB quick startLaunch :●[wjb19@tesla1 scratch]$ gdb ./foo.xRun w/ command line argument 100 :●(gdb) run 100  Set breakpoint at line 10 in source file :●(gdb) b foo.c:10Breakpoint 1 at 0x400594: file foo.c, line 10.(gdb) runStarting program: /gpfs/scratch/wjb19/foo.x Breakpoint 1, main () at foo.c:2222 int IMAGE_SIZE = 10;Step to next instruction (issuing continue will resume execution) :●(gdb) step23 float * image = (float*) malloc (IMAGE_SIZE*sizeof(float));Print second value in array image :●(gdb) p image[2]$4 = 0Display full backtrace :●(gdb) bt full#0  main () at foo.c:27        i = 0        IMAGE_SIZE = 10        image = 0x601010 wjb19@psu.edu
  33. 33. HPC Essentials Part II : Elements of Parallelism Bill BrouwerResearch Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  34. 34. Outline●Introduction ● Motivation ● HPC operations ● Multiprocessors ● Processes ● Memory Digression ● Virtual Memory ● Cache●Threads ● POSIX ● OpenMP ● Affinity wjb19@psu.edu
  35. 35. MotivationThe problems in science we seek to solve are becoming increasingly large, as●we go down in scale (eg., quantum chemistry) or up (eg., astrophysics)●As a natural consequence, we seek both performance and scaling in ourscientific applications●Therefore we want to increase floating point operations performed and memorybandwidth and thus seek parallelization as we run out of resources using asingle processor●We are limited by Amdahls law, an expression of the maximum improvement ofparallel code over serial: 1/((1-P) + P/N) where P is the portion of application code we parallelize, and N is the number ofprocessors ie., as N increases, the portion of remaining serial code becomesincreasingly expensive, relatively speaking wjb19@psu.edu
  36. 36. Motivation●Unless the portion of code we can parallelize approaches 100%, we seerapidly diminishing returns with increasing numbers of processors 12 Improvement factor 10 P=90% 8 6 4 P=60% 2 P=30% P=10% 0 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 processors●Nonetheless, for many applications we have a good chance ofparallelizing the vast majority of the code... wjb19@psu.edu
  37. 37. Example : Kirchhoff Time Migration●KTM is a technique used widely in oil+gas exploration, providing imagesinto the earths interior, used to identify resources●Seismic trace data acquired over 2D geometry is integrated to giveimage of earths interior, using ~ Greens method●Input is generally 10^4 – 10^6 traces, 10^3 – 10^4 data points each, ie.,lots of data to process; output image is also very large●This is an integral technique (ie., summation, easy to parallelize), justone of many popular algorithms performed in HPC x==image space ==seismic space t==traveltime Image point Weight Trace Data wjb19@psu.edu
  38. 38. Common Operations in HPC● Integration ● Load/store, add & multiply ● eg., transforms● Derivatives (Finite differences) ● Load/store, subtract & divide ● eg., PDE● Linear Algebra ● Load/store, subtract/add/multiply/divide ● chemistry & physics, solvers ● sparse (classical physics) & dense (quantum)●Regardless of the operations performed, after compilation into machine code,when executed by the CPU, instructions are clocked through a pipeline intoregisters for execution●Instruction execution generally takes place in four steps, and multipleinstruction groups are concurrent within the pipeline; execution rate is a directfunction of the clock rate wjb19@psu.edu
  39. 39. Execution Pipeline ●This is the most fine-grained form of parallelism; its efficiency is a strong function of branch prediction hardware, or the prediction of which instruction in a program is the next to execute* ●At a similar level, present in more recent devices are so-called streaming SIMD extension (SSE) registers and associated compute hardware Clock cycle 0 1 2 3 4 5 6 7 pending 1.Fetch 2.Decode PIPELINE executing 3.Execute 4.Write-back completed*assisted by compiler hints wjb19@psu.edu
  40. 40. SSE●Streaming SIMD (Single instruction, multiple Data) computation exploits specialregisters and instructions to increase computation many-fold in certain cases,since several data elements are operated on simultaneously●Each of 8 SSE registers (labeled xmm0 through xmm7) is 128-bit longs,storing 4 x 32-bit floating-point numbers; SSE2 and SSE3 specifications haveexpanded the allowed datatypes to include doubles, ints etc float3 float2 float1 float0 Bit 127 0●Operations may be scalar or pack (ie., vector), expressed using intrinsics in__asm block within C code eg., addps   xmm0,xmm1 operation dst operand src operandOne can either code the intrinsics explicitly, or rely on the compiler., eg., icc●with optimization (­O3)● The next level up of parallelization is the multiprocessor... wjb19@psu.edu
  41. 41. Multiprocessor Overview●Multiprocessors or multiple core CPUs are becoming ubiquitous; better scaling(cf Moores law) but limited by contention for shared resources, especiallymemory●Most commonly we deal with Symmetric Multiprocessors (SMP), with uniquecache and registers, as well as shared memory region(s); more on cache in amoment ●Memory not necessarily next to processors → Non-uniform Memory Access (NUMA); CPU0 CPU1 try to ensure memory access is as local to registers registers CPU core(s) as possible ●The proc directory on UNIX machines is a cache cache special directory written and updated by the kernel, containing information on CPU (/proc/cpuinfo) and memory (/proc/meminfo) main memory ●The fundamental unit of work on the cores is a process... wjb19@psu.edu
  42. 42. Processes●Application processes are launched on the CPU by the kernel using thefork() system call; every process has a process ID pid, available on UNIXsystems via the getpid() system call●The kernel manages many processes concurrently; all information required torun a process is contained in the process control block (PCB) data structure,containing (among other things): ● The pid ● The address space ● I/O information eg., open files/streams ● Pointer to next PCB●Processes may spawn children using the fork() system call; children areinitially a copy of the parent, but may take on different attributes via the exec()call wjb19@psu.edu
  43. 43. Processes●A child process takes the id of the parent (ppid), and additionally has a uniquepid eg., output from ps command, describing itself :[wjb19@tesla1 ~]$ ps  ­eHo "%P %p %c %t %C"  PPID   PID COMMAND             ELAPSED %CPU12608  1719     sshd           01:07:54  0.0 1719  1724       sshd         01:07:49  0.0 1724  1725         bash       01:07:48  0.0 1725  1986           ps          00:00  0.0●During a context switch, kernel will swap one process control block for another;context switches are detrimental to HPC and have one or more triggers,including: ● I/O requests ● Timer interrupts●Context switching is a very fine-grained form of scheduling; on computeclusters we also have coarse grained scheduling in the form of job schedulingsoftware (more next time)●The unique address space from the perspective of the process is referred to asvirtual memory wjb19@psu.edu
  44. 44. Virtual Memory●A running process is given memory by the kernel, referred to as virtual memory(VM); address space does not correspond to physical memory address space●The Memory Management Unit (MMU) on CPU translates between the twoaddress spaces, for requests made between process and OS●Virtual Memory for every process has the same structure, below left; virtualaddress space is divided into units called pages High Address ●The MMU is assisted in address Environment variables Function arguments translation by the Translation Lookaside Buffer (TLB), which stores Stack page details in a cache Unused ● Cache is high speed memory immediately adjacent to the CPU and its registers, connected via bus(es) Heap Low Address Instructions wjb19@psu.edu
  45. 45. Cache : IntroductionIn HPC, we talk about problems being compute or memory bound● ● In the former case, we are limited by the rate at which instructions can be executed by the CPU ● In the latter, we are limited by the rate at which data can be processed by the CPU●Both instructions and data are loaded into cache; cache memory is laidout in linesCache memory is intermediate in the overall hierarchy, lying between●CPU registers and main memory● If the executing process requests an address corresponding to data orinstructions in cache, we have a hit, else miss, and a much slowerretrieval of instruction or data from main memory must take place wjb19@psu.edu
  46. 46. Cache : Introduction●Modern architectures have various levels of cache and divisions ofresponsibilities, we will follow valgrind-cachegrind convention, from themanual: ... It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines. However, some modern machines have three levels of cache. For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and third-level caches. The reason for this choice is that the L3 cache has the most influence on runtime, as it masks accesses to main memory. Furthermore, the L1 caches often have low associativity, so simulating them can detect cases where the code interacts badly with this cache (eg. traversing a matrix column-wise with the row length being a power of 2) wjb19@psu.edu
  47. 47. Cache Example●The distribution of data to cache levels is largely set by compiler,hardware and kernel, however the programmer is still responsible for thebest data access patterns in his/her code possible●Use cachegrind to optimize data alignment & cache usage eg.,#include <stdlib.h>#include <stdio.h>int main(){        int SIZE_X,SIZE_Y;        SIZE_X=2048;        SIZE_Y=2048;        float * data = (float*) malloc(SIZE_X*SIZE_Y*sizeof(float));        for (int i=0; i<SIZE_X; i++)                for (int j=0; j<SIZE_Y; j++)                        data[j+SIZE_Y*i] = 10.0f * 3.14f;                        //bad data access                               //data[i+SIZE_Y*j] = 10.0f * 3.14f;                     free(data);        return 0;} wjb19@psu.edu
  48. 48. Cache : Bad Accessbill@bill­HP­EliteBook­6930p:~$ valgrind ­­tool=cachegrind ./foo.x==3088== Cachegrind, a cache and branch­prediction profiler==3088== Copyright (C) 2002­2010, and GNU GPLd, by Nicholas Nethercote et al.==3088== Using Valgrind­3.6.1 and LibVEX; rerun with ­h for copyright info==3088== Command: ./foo.x==3088== ==3088== ==3088== I   refs:      50,503,275==3088== I1  misses:           734==3088== LLi misses:           733 instructions==3088== I1  miss rate:       0.00%==3088== LLi miss rate:       0.00%==3088==  READ Ops WRITE Ops==3088== D   refs:      33,617,678  (29,410,213 rd   + 4,207,465 wr)==3088== D1  misses:     4,197,161  (     2,335 rd   + 4,194,826 wr)==3088== LLd misses:     4,196,772  (     1,985 rd   + 4,194,787 wr) data==3088== D1  miss rate:       12.4% (       0.0%     +      99.6%  )==3088== LLd miss rate:       12.4% (       0.0%     +      99.6%  )==3088== ==3088== LL refs:        4,197,895  (     3,069 rd   + 4,194,826 wr)==3088== LL misses:      4,197,505  (     2,718 rd   + 4,194,787 wr)==3088== LL miss rate:         4.9% (       0.0%     +      99.6%  ) lowest level wjb19@psu.edu
  49. 49. Cache : Good Accessbill@bill­HP­EliteBook­6930p:~$ valgrind ­­tool=cachegrind ./foo.x==4410== Cachegrind, a cache and branch­prediction profiler==4410== Copyright (C) 2002­2010, and GNU GPLd, by Nicholas Nethercote et al.==4410== Using Valgrind­3.6.1 and LibVEX; rerun with ­h for copyright info==4410== Command: ./foo.x==4410== ==4410== ==4410== I   refs:      50,503,275==4410== I1  misses:           734==4410== LLi misses:           733==4410== I1  miss rate:       0.00%==4410== LLi miss rate:       0.00%==4410== ==4410== D   refs:      33,617,678  (29,410,213 rd   + 4,207,465 wr)==4410== D1  misses:       265,002  (     2,335 rd   +   262,667 wr)==4410== LLd misses:       264,613  (     1,985 rd   +   262,628 wr)==4410== D1  miss rate:        0.7% (       0.0%     +       6.2%  )==4410== LLd miss rate:        0.7% (       0.0%     +       6.2%  )==4410== ==4410== LL refs:          265,736  (     3,069 rd   +   262,667 wr)==4410== LL misses:        265,346  (     2,718 rd   +   262,628 wr)==4410== LL miss rate:         0.3% (       0.0%     +       6.2%  ) wjb19@psu.edu
  50. 50. Cache Performance●For large data problems, any speedup introduced by parallelization can easilybe negated by poor cache utilization●In this case, memory bandwidth is an order of magnitude worse for problemsize (2^14)^2 (cf earlier note on widely variable memory bandwidths; we have towork hard to approach peak)● In many cases we are limited also by random access patterns 12 High % miss 10 8 time (s) 6 4 2 Low % miss 0 10 11 12 13 14 log2 SIZE_X wjb19@psu.edu
  51. 51. Outline●Introduction ● Motivation ● Computational operations ● Multiprocessors ● Processes ● Memory Digression ● Virtual Memory ● Cache●Threads ● POSIX ● OpenMP ● Affinity wjb19@psu.edu
  52. 52. POSIX Threads I●A process may spawn one or more threads; on a multiprocessor, theOS can schedule these threads across a variety of cores, providingparallelism in the form of light-weight processes (LWP)●Whereas a child process receives a copy of the parents virtual memoryand executes independently thereafter, a thread shares the memory ofthe parent including instructions, and also has private dataUsing threads we perform shared memory processing (cf distributed●memory, next time)●We are at liberty to launch as many threads as we wish, although as youmight expect, performance takes a hit as more threads are launchedthan can be scheduled simultaneously across available cores wjb19@psu.edu
  53. 53. POSIX Threads II●Pthreads refers to the POSIX standard, which is just a specification;implementations exist for various systemsEach pthread has:● ● An ID ● Attributes : ● Stack size ● Schedule information●Much like processes, we can monitor thread execution using utilitiessuch as top and ps●The memory shared among threads must be used carefully in order toprevent race conditions, or threads seeing incorrect data duringexecution, due to more than one thread performing operations on saiddata, in an uncoordinated fashion wjb19@psu.edu
  54. 54. POSIX Threads III●Race conditions may be ameliorated through careful coding, but alsothrough explicit constructs eg., locks, whereby a single thread gains andrelinquishes control→ implies serialization and computational overhead●Multi-Threaded programs must also avoid deadlock, a highly undesirousstate where one or more threads await resources, and in turn are unableto offer up resources required by others●Deadlocks can also be avoided through good coding, as well as the useof communication techniques based around semaphores, for example●Threads awaiting resources may sleep (context switch by kernel, slow,saves cycles) or busy wait (executes while loop or similar checkingsemaphore, fast, wastes cycles) wjb19@psu.edu
  55. 55. Pthreads Example#include <pthread.h>#include <stdio.h>#include <stdlib.h>int sum; void *worker(void *param); global (shared) variableint main(int argc, char *argv[]){ main thread        pthread_t tid; thread id & attributes        pthread_attr_t attr;        if (argc!=2 || atoi(argv[1])<0){                printf("usage : a.out <int value>, where int value > 0n");                return ­1;        }          pthread_attr_init(&attr);        pthread_create(&tid,&attr,worker,argv[1]); worker thread        pthread_join(tid,NULL);        printf("sum = %dn",sum); creation & join} after completionvoid * worker(void *total){        int upper=atoi(total);        sum = 0; local (private) variable        for (int i=0; i<upper; i++)                sum += i;        pthread_exit(0);} wjb19@psu.edu
  56. 56. Valgrind-helgrind output[wjb19@hammer16 scratch]$ valgrind ­­tool=helgrind ­v ./foo.x 100 ==5185== Helgrind, a thread error detector==5185== Copyright (C) 2007­2009, and GNU GPLd, by OpenWorks LLP et al.==5185== Using Valgrind­3.5.0 and LibVEX; rerun with ­h for copyright info==5185== Command: ./foo.x 100==5185== ­­5185­­ Valgrind options: system calls establishing thread ie., there­­5185­­    ­­tool=helgrind is a COST to create and destroy threads­­5185­­    ­v­­5185­­ Contents of /proc/version:­­5185­­   Linux version 2.6.18­274.7.1.el5 (mockbuild@x86­004.build.bos.redhat.com) (gcc version ­­5185­­ REDIR: 0x3a97e7c240 (memcpy) redirected to 0x4a09e3c (memcpy)­­5185­­ REDIR: 0x3a97e79420 (index) redirected to 0x4a09bc9 (index)­­5185­­ REDIR: 0x3a98a069a0 (pthread_create@@GLIBC_2.2.5) redirected to 0x4a0b2a5 (pthread_create@*)­­5185­­ REDIR: 0x3a97e749e0 (calloc) redirected to 0x4a05942 (calloc)­­5185­­ REDIR: 0x3a98a08ca0 (pthread_mutex_lock) redirected to 0x4a076c2 (pthread_mutex_lock)­­5185­­ REDIR: 0x3a97e74dc0 (malloc) redirected to 0x4a0664a (malloc)­­5185­­ REDIR: 0x3a98a0a020 (pthread_mutex_unlock) redirected to 0x4a07b66 (pthread_mutex_unlock)­­5185­­ REDIR: 0x3a97e79b50 (strlen) redirected to 0x4a09cbb (strlen)­­5185­­ REDIR: 0x3a98a07a10 (pthread_join) redirected to 0x4a07431 (pthread_join)sum = 4950==5185== ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)­­5185­­ ­­5185­­ used_suppression:      1 helgrind­glibc2X­101­­5185­­ used_suppression:      1 helgrind­glibc2X­112­­5185­­ used_suppression:      1 helgrind­glibc2X­102==5185== ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3) wjb19@psu.edu
  57. 57. Pthreads: Race Condition#include <pthread.h>#include <stdio.h>#include <stdlib.h>int sum;void *worker(void *param);int main(int argc, char *argv[]){        pthread_t tid;        pthread_attr_t attr;        if (argc!=2 || atoi(argv[1])<0){                printf("usage : a.out <int value>, where int value > 0n");                return ­1;        }        pthread_attr_init(&attr);        pthread_create(&tid,&attr,worker,argv[1]);        int upper=atoi(argv[1]); main thread works on        sum=0;        for (int i=0; i<upper; i++) global variable as well,                sum+=i; without synchronization/        pthread_join(tid,NULL); coordination        printf("sum = %dn",sum);} wjb19@psu.edu
  58. 58. Helgrind output w/ race[wjb19@hammer16 scratch]$ valgrind ­­tool=helgrind ./foo.x 100 ==5384== Helgrind, a thread error detector==5384== Copyright (C) 2007­2009, and GNU GPLd, by OpenWorks LLP et al.==5384== Using Valgrind­3.5.0 and LibVEX; rerun with ­h for copyright info==5384== Command: ./foo.x 100==5384== ==5384== Thread #1 is the programs root thread built foo.x with debug on (-g) to==5384==  find source file line(s) w/==5384== Thread #2 was created==5384==    at 0x3A97ED447E: clone (in /lib64/libc­2.5.so) error(s)==5384==    by 0x3A98A06D87: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread­2.5.so)==5384==    by 0x4A0B206: pthread_create_WRK (hg_intercepts.c:229)==5384==    by 0x4A0B2AD: pthread_create@* (hg_intercepts.c:256)==5384==    by 0x400748: main (fooThread2.c:18)==5384== ==5384== Possible data race during write of size 4 at 0x600cdc by thread #1==5384==    at 0x400764: main (fooThread2.c:20)==5384==  This conflicts with a previous write of size 4 by thread #2==5384==    at 0x4007E3: worker (fooThread2.c:31)==5384==    by 0x4A0B330: mythread_wrapper (hg_intercepts.c:201)==5384==    by 0x3A98A0673C: start_thread (in /lib64/libpthread­2.5.so)==5384==    by 0x3A97ED44BC: clone (in /lib64/libc­2.5.so)==5384==●Pthreads is a versatile albeit large and inherently complicated interface●We are primarily concerned with simply dividing a workload amongavailable cores; OpenMP proves much less unwieldy to use wjb19@psu.edu
  59. 59. OpenMP Introduction●OpenMP is a set of multi-platform/OS compiler directives, libraries andenvironment variables for readily creating multi-threaded applications●The OpenMP standard is managed by a review board, and is defined by a largenumber of hardware vendors●Applications written using OpenMP employ pragmas, or statements interpretedby the preprocessor (before compilation), representing functionality like fork &join that would take considerably more effort and care to implement otherwise●OpenMP pragmas or directives indicate parallel sections of code ie., aftercompilation, at runtime, threads are each given a portion of work eg., in thiscase, loop iterations will be divided evenly among running threads :#pragma omp parallel forfor (int i=0; i<SIZE; i++) y[i]=x[i]*10.0f; wjb19@psu.edu
  60. 60. OpenMP Clauses I●The number of threads launched during parallel blocks may be set via functioncalls or by setting the OMP_NUM_THREADS environment variable●Data objects are generally by default shared (loop counters are private bydefault), a number of pragma clauses are available, which are valid for thescope of the parallel section eg., : ● private ● shared ● firstprivate -initialized to value before parallel block ● lastprivate -variable keeps value after parallel block ● reduction -thread safe way of combining data at conclusion of parallel block●Thread synchronization is implicit to parallel sections; there are a variety ofclauses available for controlling this behavior also, including : ● critical-one thread at a time works in this section eg., in order to avoid race (expensive, design your code to avoid at all costs) ● atomic- safe memory updates performed using eg., mutual exclusion (cost) ● barrier-threads wait at this point for others to arrrive wjb19@psu.edu
  61. 61. OpenMP Clauses IIOpenMP has default thread scheduling behavior handled via the runtime library,●which may be modified through use of the schedule(type,chunk) clause,with types : ● static ­ loop iterations are divided among threads equally by default; specifying an integer for the parameter chunk will allocate a number of contiguous iterations to a thread ● dynamic ­ total iterations form a pool, from which threads work on small contiguous subsets until all are complete, with subset size given again by chunk ● guided ­ a large section of contiguous iterations are allocated to each thread dynamically. The section size decreases exponentially with each successive allocation to a minimum size specified by chunk wjb19@psu.edu
  62. 62. OpenMP Example : KTM●In our first attempt at parallelization shortly, we simply add an OpenMP pragmabefore the computational loops in worker function:#pragma omp parallel for//loop over trace recordsfor (int k=0; k<config­>traceNo; k++){ //loop over imageX for(int i=0; i<Li; i++){ tempC = ( midX[k] ­ imageXX[i]­offX[k]) * (midX[k]­ imageXX[i]­offX[k]);          tempD = ( midX[k] ­ imageXX[i]+offX[k]) * (midX[k]­ imageXX[i]+offX[k]);          //loop over imageY          for(int j=0; j<Lj; j++){           tempA = tempC + ( midY[k] ­ imageYY[j]­offY[k]) * (midY[k]­ imageYY[j]­offY[k]);               tempB = tempD + ( midY[k] ­ imageYY[j]+offY[k]) * (midY[k]­ imageYY[j]+offY[k]); //loop over imageZ                                             for (int l=0; l<Ll; l++){                temp = sqrtf(tauS[l] + tempA * slownessS[l]);                    temp += sqrtf(tauS[l] + tempB * slownessS[l]);                    timeIndex = (int) (temp / sRate);                    if ((timeIndex < config­>tracePts) && (timeIndex > 0)){                    image[i*Lj*Ll + j*Ll + l] += traces[timeIndex + k * config­>tracePts] * temp *sqrtf(tauS[l] / temp);                   }               } //imageZ          } //imageY     } //imageX}//input trace records wjb19@psu.edu
  63. 63. OpenMP KTM Results●Scales well up to eight cores, then drops off; SMP model has deficiencies dueto a number of factors, including : ● Coverage (Amdahls law); as we increase processors, relative cost of serial code portion increases ● Hardware limitations ● Locality... 5 4.5 4 Execution time 3.5 3 2.5 2 1.5 1 0.5 0 1 2 4 8 16 CPU cores wjb19@psu.edu
  64. 64. CPU Affinity (Intel*) ●Recall that the OS schedules processes and threads using context switches; can be detrimental → threads may resume on different core, destroying locality ●We can change this by restricting threads to execute on a subset of processors, by setting processor affinity ●Simplest approach is to set environment variable KMP_AFFINITY to: ● determine the machine topology, ● assign threads to processors ●Usage: KMP_AFFINITY=[<modifier>]<type>[<permute>][<offset>] *For GNU, ~ equivalent env var == GOMP_CPU_AFFINITY wjb19@psu.edu
  65. 65. CPU Affinity Settings●The modifier may take settings corresponding to granularity (with specifiers:fine, thread, and core), as well as a processor list (proclist={<proc­list>}), verbose, warnings and others● The type settings refer to the nature of the affinity, and may take values : ● compact-try to assign thread n+1 context as close as possible to n ● disabled ● explicit-force assign of threads to processors in proclist ● none-just return the topology w/ verbose modifier ● scatter-distribute as evenly as possible●fine & thread refer to the same thing, namely that threads only resume inthe same context; the core modifier implies that they may resume within adifferent context, but the same physical core●CPU Affinity can effect application performance significantly and is worth tuning,based on your application and the machine topology... wjb19@psu.edu
  66. 66. CPU Topology Map●For any given computational node, we have several different physical devices(packages in sockets), comprised of cores (eg., two here), which run one or twothread contexts●Without hyperthreading, there is only a single context per core ie., modifiersthread/fine, core are indistinguishable Node packageA packageB core0 core1 core0 core1 0 1 0 1 0 1 0 1 Thread context wjb19@psu.edu
  67. 67. CPU Affinity Examples●Display machine topology map eg,. Hammer :[wjb19@hammer16 scratch] $ export KMP_AFFINITY=verbose,none[wjb19@hammer16 scratch] $ ./psktm.xOMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 infoOMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #156: KMP_AFFINITY: 12 available OS procsOMP: Info #157: KMP_AFFINITY: Uniform topologyOMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} wjb19@psu.edu
  68. 68. CPU Affinity Examples●Set affinity with compact setting, fine granularity :[wjb19@hammer5 scratch]$ export KMP_AFFINITY=verbose,granularity=fine,compact[wjb19@hammer5 scratch]$ ./psktm.x OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 infoOMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #156: KMP_AFFINITY: 12 available OS procsOMP: Info #157: KMP_AFFINITY: Uniform topologyOMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 10 OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 10 OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {10}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {6}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {1}OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {9}OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {3}OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {11} wjb19@psu.edu
  69. 69. Conclusions●Scientific research is supported by computational scaling and performance,both provided by parallelism, limited to some extent by Amdahls law●Parallelism has various levels of granularity; at the finest level is the instructionpipeline and vectorized registers eg., SSE●The next level up in parallel granularity is the multiprocessor; we may run manyconcurrent threads using the pthreads API or the OpenMP standard for instance●Threads must be coded and handled with care, to avoid race and deadlockconditions●Performance is a strong function of cache utilization; benefits introducedthrough parallelization can easily be negated by sloppy use of memorybandwidth●Scaling across cores is limited by hardware, Amdahls law but also locality; wehave some control over the latter using  KMP_AFFINITY for instance wjb19@psu.edu
  70. 70. References●Valgrind (buy the manual, worth every penny) ● http://valgrind.org/●OpenMP ● http://openmp.org/wp/●GNU OpenMP ● http://gcc.gnu.org/projects/gomp/●Summary of OpenMP 3.0 C/C++ Syntax ● http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf●Summary of OpenMP 3.0 Fortran Syntax ● http://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdf●Nice SSE tutorial ● http://neilkemp.us/src/sse_tutorial/sse_tutorial.html●Intel Nehalem ● http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29●GNU Make ● http://www.gnu.org/s/make/●Intel hyperthreading ● http://en.wikipedia.org/wiki/Hyper-threading wjb19@psu.edu
  71. 71. Exercises●Take the supplied code and parallelize using OpenMPpragma around the worker function●Create a makefile which builds the code, compare timingsbtwn serial & parallel by varying OMP_NUM_THREADS●Examine effect of various settings for KMP_AFFINITY wjb19@psu.edu
  72. 72. Build w/ Confidence : make #Makefile for basic Kirchhoff Time Migration example #set compiler CC=icc ­openmp #set build options CFLAGS=­std=c99 ­c #main executable all: psktm.x #objects and dependencies psktm.x: psktmCPU.o demoA.o         $(CC) psktmCPU.o demoA.o ­o psktm.x psktmCPU.o: psktmCPU.c         $(CC) $(CFLAGS) psktmCPU.c demoA.o: demoA.c         $(CC) $(CFLAGS) demoA.c clean:         rm ­rf *o psktm.x wjb19@psu.eduindent with tab only!
  73. 73. HPC Essentials Part III : Message Passing Interface Bill BrouwerResearch Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  74. 74. Outline●Motivation●Interprocess Communication ● Signals ● Sockets & Networks●procfs Digression●Message Passing Interface ● Send/Receive ● Communication ● Parallel Constructs ● Grouping Data ● Communicators & Topologies wjb19@psu.edu
  75. 75. Motivation●We saw last time that Amdahls law implies an asymptotic limit toperformance gains from parallelism, where parallel P and serial code (1-P) portions have fixed relative cost●We looked at threads (“light-weight processes”) and also saw thatperformance depends on a variety of things, including good cacheutilization and affinity●For the problem size investigated, ultimately the limiting factor was diskI/O, there was no sense going beyond a single compute node; in amachine with 16 cores or more, there is no point when P < 60%, shouldthe process have sufficient memory●However, as we increase our problem size, the relative parallel/serialcost changes and P can approach 1 wjb19@psu.edu
  76. 76. Motivation●In the limit as processors N → we find the maximum performanceimprovement : 1/(1-P)●It is helpful to see the 3dB points for this limit ie., the number of processors N 1/2required to achieve (1/√2)*max = 1/(√2*(1-P)); equating with Amdahls law &after some algebra : N1/2 = 1/((1-P)*(√2-1)) 300 250 200 N1/2 150 100 50 0 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Parallel code fraction P wjb19@psu.edu
  77. 77. MotivationPoints to note from the graph :● ● P ~ 0.90, we can benefit from ~ 20 cores ● P ~ 0.99, we can benefit from a cluster size of ~ 256 cores ● P → 1, we approach the “embarrassingly parallel” limit ● P ~ 1, performance improvement directly proportional to cores ● P ~ 1 implies independent or batch processes●Quite aside from considerations of Amdahls law, as the problem sizegrows, we may simply exceed the memory available on a single node●In this case, must move to a distributed memory processingmodel/multiple nodes (unless P ~ 1 of course)How do we determine P? → PROFILING● wjb19@psu.edu
  78. 78. Profiling w/ Valgrind [wjb19@lionxf scratch]$ valgrind ­­tool=callgrind ./psktm.x [wjb19@lionxf scratch]$ callgrind_annotate ­­inclusive=yes callgrind.out.3853  ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ Profile data file callgrind.out.3853 (creator: callgrind­3.5.0) ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ I1 cache:  D1 cache:  L2 cache:  Parallelizable worker Timerange: Basic block 0 ­ 2628034011 function is 99.5% of Trigger: Program termination Profiled target:  ./psktm.x (PID 3853, part 1) total instructions executed ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 20,043,133,545  PROGRAM TOTALS ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­             Ir  file:function ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 20,043,133,545  ???:0x0000003128400a70 [/lib64/ld­2.5.so] 20,042,523,959  ???:0x0000000000401330 [/gpfs/scratch/wjb19/psktm.x] 20,042,522,144  ???:(below main) [/lib64/libc­2.5.so] 20,042,473,687  /gpfs/scratch/wjb19/demoA.c:main 20,042,473,687  demoA.c:main [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644  psktmCPU.c:ktmMigrationCPU [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644  /gpfs/scratch/wjb19/psktmCPU.c:ktmMigrationCPU  6,359,083,826  ???:sqrtf [/gpfs/scratch/wjb19/psktm.x]  4,402,442,574  ???:sqrtf.L [/gpfs/scratch/wjb19/psktm.x]    104,966,265  demoA.c:fileSizeFourBytes [/gpfs/scratch/wjb19/psktm.x]If we wish to scale outside a single node, we must use some form of interprocesscommunication wjb19@psu.edu
  79. 79. Inter-Process Communication● There are a variety of ways for processes to exchange information, including: ● Memory (~last week) ● Files ● Pipes (named/anonymous) ● Signals ● Sockets ● Message Passing● File I/O is too slow, and read/writes liable to race conditions● Anonymous & named pipes are highly efficient but FIFO (first in, first out)buffers, allowing only unidirectional communication, and between processes onthe same node●Signals are a very limited form of communication, sent to the process after aninterrupt by the kernel, and handled using a default handler or one specifiedusing signal() system call●Signals may come from a variety of sources eg., segmentation fault (SIGSEGV),keyboard interrupt Ctrl-C (SIGINT) etc wjb19@psu.edu
  80. 80. Signals●strace is a powerful utility in UNIX which shows the interaction between arunning process and kernel in the form of system calls and signals; here, apartial output showing mapping of signals to defaults with system callsigaction(), from ./psktm.x : UNIX signalsrt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGILL, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGABRT, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGFPE, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGSEGV, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0●Signals are crude and restricted to local communication; to communicateremotely, we can establish a socket between processes, and communicate overthe network wjb19@psu.edu
  81. 81. Sockets & Networks●Davies/Baran first devised packet switching, an efficient means ofcommunication over a channel; a computer was conceived to realize theirdesign and ARPANET went online Oct 1969 between UCLA and Stanford●TCP/IP became the communication protocol of ARPANET 1 Jan 1983, whichwas retired in 1990 and NFSNET established; university networks in the US andEurope join●TCP/IP is just one of many protocols, which describes the format of datapackets, and the nature of the communication; an analogous connection methodis used by Infiniband networks in conjunction with Remote Direct MemoryAccess (RDMA)●Unreliable Datagram Protocol (UDP) is analogous to a connectionless methodof communication used by Infiniband high performance networks wjb19@psu.edu
  82. 82. Sockets : UDP host example#include <stdio.h>#include <errno.h>#include <string.h>#include <sys/socket.h>#include <sys/types.h>#include <netinet/in.h>#include <unistd.h> /* for close() for socket */ #include <stdlib.h> int main(void){  //creates an endpoint & returns file descriptor  //uses IPv4 domain, datagram type, UDP transport  int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);    //socket address object (sa) and memory buffer  struct sockaddr_in sa;   char buffer[1024];  ssize_t recsize;  socklen_t fromlen;   //specify same domain type, any input address and port 7654 to listen on  memset(&sa, 0, sizeof sa);  sa.sin_family = AF_INET;  sa.sin_addr.s_addr = INADDR_ANY;  sa.sin_port = htons(7654);  fromlen = sizeof(sa);      wjb19@psu.edu
  83. 83. Sockets : host example cont.  //we bind an address (sa) to the socket using fd sock  if (­1 == bind(sock,(struct sockaddr *)&sa, sizeof(sa)))  {    perror("error bind failed");    close(sock);    exit(EXIT_FAILURE);  }    for (;;)   {    //listen and dump buffer to stdout where applicable    printf ("recv test....n");    recsize = recvfrom(sock, (void *)buffer, 1024, 0, (struct sockaddr *)&sa, &fromlen);    if (recsize < 0) {      fprintf(stderr, "%sn", strerror(errno));      exit(EXIT_FAILURE);    }    printf("recsize: %zn ", recsize);    sleep(1);    printf("datagram: %.*sn", (int)recsize, buffer);  }}     wjb19@psu.edu
  84. 84. Sockets : client exampleint main(int argc, char *argv[]){  //create a buffer with character data  int sock;  struct sockaddr_in sa;  int bytes_sent;  char buffer[200];   strcpy(buffer, "hello world!");   //create a socket, same IP and transport as before, address of host  sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);  if (­1 == sock) /* if socket failed to initialize, exit */    {      printf("Error Creating Socket");      exit(EXIT_FAILURE);    }   memset(&sa, 0, sizeof sa);  sa.sin_family = AF_INET;  sa.sin_addr.s_addr = inet_addr("");  sa.sin_port = htons(7654);   bytes_sent = sendto(sock, buffer, strlen(buffer), 0,(struct sockaddr*)&sa, sizeof sa);  if (bytes_sent < 0) {    printf("Error sending packet: %sn", strerror(errno));    exit(EXIT_FAILURE);  }   close(sock); /* close the socket */  return 0;}●You can monitor sockets by using the netstat facility, which takes its datafrom /proc/net wjb19@psu.edu
  85. 85. Outline●Motivation●Interprocess Communication ● Signals ● Sockets & Networks●procfs Digression●Message Passing ● Send/Receive ● Communication ● Parallel Constructs ● Grouping Data ● Communicators & Topologies wjb19@psu.edu
  86. 86. procfs●We mentioned the /proc directory previously in the context of cpu andmemory information, which is frequently referred to as the proc filesystem orprocfs●It is a veritable treasure trove of information, written periodically by the kernel,and is used by a variety of tools eg., ps● Each running process is assigned a directory, whose name is the process id●Each directory contains text files and subdirectories with every detail of arunning process, including context switching statistics, memory management,open file descriptors and much more●Much like the ptrace() system call, procfs also gives user applications theability to directly manipulate running processes, given sufficient permission; youcan explore that on your own :) wjb19@psu.edu
  87. 87. procfs : examples● Some of the more useful files : ● /proc/PID/cmdline : command used to launch process ● /proc/PID/cwd : current working directory ● /proc/PID/environ : environment variables for the process ● /proc/PID/fd : directory w/ symbolic link for each open file descriptor eg., streams ● /proc/PID/status : information including signals, state, memory usage ● /proc/PID/maps : memory map between virtual and physical addresses●● eg., contents of the fd firectory for running process ./psktm.x :[wjb19@hammer1 fd]$ ls ­lahtotal 0dr­x­­­­­­ 2 wjb19 wjb19  0 Dec  7 12:13 .dr­xr­xr­x 6 wjb19 wjb19  0 Dec  7 12:10 ..lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 0 ­> /dev/pts/28lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 1 ­> /dev/pts/28lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 2 ­> /dev/pts/28lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 3 ­> /gpfs/scratch/wjb19/inputDataSmall.binlrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 4 ­> /gpfs/scratch/wjb19/inputSrcXSmall.binlrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 5 ­> /gpfs/scratch/wjb19/inputSrcYSmall.binlrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 6 ­> /gpfs/scratch/wjb19/inputRecXSmall.binlrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 7 ­> /gpfs/scratch/wjb19/inputRecYSmall.binlrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 8 ­> /gpfs/scratch/wjb19/velModel.bin  wjb19@psu.edu
  88. 88. procfs : status file extract[wjb19@hammer1 30769]$ more statusName: psktm.xState: R (running)SleepAVG: 0%Tgid: 30769Pid: 30769PPid: 30687TracerPid: 0Uid: 2511 2511 2511 2511Gid: 2530 2530 2530 2530FDSize: 256Groups: 2472 2530 3835 4933 5505 5732 VmPeak:    65520 kBVmSize:    65520 kBVmLck:        0 kBVmHWM:    37016 kBVmRSS:    37016 kBVmData:    51072 kBVmStk:       88 kB Virtual memory usageVmExe:       64 kBVmLib:     2944 kBVmPTE:      164 kBStaBrk: 1289a000 kBBrk: 128bb000 kBStaStk: 7fffbd0a0300 kBThreads: 5SigQ: 0/398335SigPnd: 0000000000000000ShdPnd: 0000000000000000SigBlk: 0000000000000000 signalsSigIgn: 0000000000000000SigCgt: 0000000180000000 wjb19@psu.edu
  89. 89. Outline●Motivation●Interprocess Communication ● Signals ● Sockets & Networks●procfs Digression●Message Passing Interface ● Send/Receive ● Communication ● Parallel Constructs ● Grouping Data ● Communicators & Topologies wjb19@psu.edu
  90. 90. Message Passing Interface (MPI)●Classical von Neumann machine has single instruction/data stream (SISD) →single process & memory●Multiple Instruction, multiple data (MIMD) system → connected processes areasynchronous, generally distributed memory (may also be shared whereprocesses on single node)MIMD Processors are connected in some network topology; we dont have to●worry about the details, MPI abstracts this away●MPI is a standard for parallel programming first established in 1991, updatedoccasionally, by academics and industry●It comprises routines for point-to-point and collective communication, withbindings to C/C++ and fortran● Depending on underlying network fabric, communication maybe TCP or UDP-like in Infiniband networks wjb19@psu.edu
  91. 91. MPI : Basic communication●Multiple, distributed processes are spawned at initialization, each processassigned a unique rank 0,1,...,p-1● One may send information referencing process rank eg.,: MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); Buffer address Rank of rcv● This function has a receive analogue; both routines are blocking by default●Send/receive statements generally occur in same code, processors executeappropriate statement according to rank & code branchNon-blocking functions available, allows communicating processes to continue●with execution where able wjb19@psu.edu
  92. 92. MPI : Requisite functions●Bare minimum → initialize, get rank for process, total processes andfinalize when doneMPI_Init(&argc, &argv); //Start upMPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //My rankMPI_Comm_size(MPI_COMM_WORLD, &p); //No. processorsMPI_Finalize(); //close up shop●MPI_COMM_WORLD is a communicator parameter, a collection ofprocesses that can send messages to each other.●Messages are sent with tags to identify them, allowing specificity beyondusing just a source/destination parameter wjb19@psu.edu
  93. 93. MPI : DatatypesMPI_CHAR signed charMPI_SHORT signed short intMPI_INT signed intMPI_LONG signed long intMPI_UNSIGNED_CHAR unsigned charMPI_UNSIGNED_SHORT unsigned short intMPI_UNSIGNED unsigned intMPI_UNSIGNED_LONG unsigned long intMPI_FLOAT float MPI_DOUBLE doubleMPI_LONG_DOUBLE long doubleMPI_BYTEMPI_PACKED wjb19@psu.edu
  94. 94. Minimal MPI example#include "mpi.h"#include <stdio.h>int main(int argc, char *argv[]){        int rank, size, i;        int buffer[10];        MPI_Status status;        MPI_Init(&argc, &argv);        MPI_Comm_size(MPI_COMM_WORLD, &size);        MPI_Comm_rank(MPI_COMM_WORLD, &rank);        if (rank > 0)        {        for (int i =0; i<10; i++)                        buffer[i]=i * rank;          MPI_Send(buffer, 10, MPI_INT, 0, 0, MPI_COMM_WORLD);        } else {          for (int i=1; i<size; i++){            MPI_Recv(buffer, 10, MPI_INT, i, 0, MPI_COMM_WORLD, &status);           printf("buffer element 0 : %i from proc : %i n",buffer[0],i);      }     }  MPI_Finalize();   return 0;} wjb19@psu.edu
  95. 95. MPI : Collective Communication● A communication pattern involving all processes in a communicator is a collective communication eg., a broadcast● Same data sent to every process in communicator, more efficient than using multiple p2p routines, optimized :MPI_Bcast(void* message, int count, MPI_Datatype type,  int root, MPI_Comm comm)● Sends copy of data in message from root process to all in comm, a scatter/map operation● Collective communication is at the heart of efficient parallel operations wjb19@psu.edu
  96. 96. Parallel Operations : Reduction● Data maybe gathered/reduced after computation via :MPI_Reduce(void* operand, void* result, int count, MPI_Datatype type, MPI_Op operator, int root, MPI_Comm comm)● Combines all operand, using operator and stores result on process root, in result● A tree-structured reduce at all nodes == MPI_Allreduce,ie., every process in comm gets a copy of the result 1 2 3 p-1 0 root wjb19@psu.edu
  97. 97. Reduction OpsMPI_MAXMPI_MINMPI_SUMMPI_PRODMPI_LAND Logical andMPI_BAND Bitwise andMPI_LOR Logical orMPI_BOR Bitwise orMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Max w/ locationMPI_MINLOC Min w/ locationMPI_PACKED wjb19@psu.edu
  98. 98. Parallel Operations : Scatter/Gather● Bulk transfers of many-to-one and one-to-many are accomplished by gather and scatter operations respectively● These operations form the kernel of matrix/vector operations for example; they are useful for distributing and reassembling arrays Process 0 x0 a00 a01 a02 a03 Process 1 x1 x2 Process 2 x3 Process 3 Gather Scatter wjb19@psu.edu
  99. 99. Scatter/Gather Syntax● MPI_Gather(void* send_data, int send_count, MPI_Datatype  send_type, void* recv_data, int recv_count, MPI_Datatype  recv_type, int root, MPI_Comm comm)● Collects data referenced by send_data from each process in comm and stores data in process rank order on process w/ rank root, in memory referenced by recv_data● MPI_Scatter(void* send_data, int send_count,  MPI_Datatype send_type, void* recv_data, int recv_count,  MPI_Datatype recv_type, int root, MPI_Comm comm)● Splits data referenced by send_data on process w/ rank root into segments, send_count elements each, w/ send_type & distributed in order to processes● For gather result to ALL processes → MPI_Allgather wjb19@psu.edu
  100. 100. Grouping Data I● Communication is expensive → bundle variables into single message● We must define a derived type than can describe the heterogeneous contents of a message using type and displacement pairs● Several ways to build this MPI_Datatype eg.,MPI_Type_Struct(int count,int block_lengths[], //contains no. entries in each blockMPI_Aint displacements[], //element offset from msg startMPI_Datatype typelist[], //exactly thatMPI_Datatype* new_mpi_t //a pointer to this new type)Allows for addresses > int● A very general derived type, although arrays to struct must be constructed explicitly using other MPI commands● Simpler when less heterogeneous eg., MPI_Type_vector, MPI_Type_Contiguous, MPI_Type_indexed wjb19@psu.edu
  101. 101. Grouping Data II● Before these derived types can be used by a communication function, must be committed with MPI_type_commit function call● In order for message to be received, type signatures at send and receive must be compatible; if a collective communication, signatures must be identical● MPI_Pack & MPI_Unpack are useful for when messages of heterogeneous data are infrequent, and cost of constructing derived type outweighs benefit● These methods also allow buffering in user versus system memory, and the number of items transmitted is in the message itself● Group data allows for sophisticated objects; we can also create more fined grained communication objects wjb19@psu.edu
  102. 102. Communicators● Process subsets or groups expand communication beyond simple p2p and broadcast communication, to create : ● Intra-communicators → communicate among one other and participate in collective communication, composed of : – an ordered collection of processes (group) – a context ● Inter-communicators → communicate between different groups● Communicators/groups are opaque, internals not directly accessible; these objects are referenced by a handle wjb19@psu.edu
  103. 103. Communicators Cont.● Internal contents manipulated by methods, much like private data in C++ class objects eg., ● int MPI_Group_incl(MPI_Group old_group,int  new_group_size, int ranks_in_old_group[], MPI_Group*  new_group) → create a new_group from old_group, using ranks_in_old_group[] etc ● int MPI_Comm_create(MPI_Comm old_comm, MPI_Group  new_group, MPI_Comm* new_comm) → create a new communicator from the old, with context● MPI_Comm_group and MPI_Group_incl are local methods without communication, MPI_Comm_create is a collective communication implying synchronization ie,. to establish single context● Multiple communicators may be created simultaneously using MPI_Comm_split wjb19@psu.edu
  104. 104. Topologies I● MPI allows one to associate different addressing schemes to processes within a group● This is a virtual versus real or physical topology, and is either a graph structure or a (Cartesian) grid; properties: ● Dimensions, w/ – Size of each – Period of each ● Option to have processes reordered optimally within grid● Method to establish Cartesian grid cart_comm :int MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm)● old_comm is typically just MPI_COMM_WORLD created at init wjb19@psu.edu
  105. 105. Topologies II● cart_comm will contain the processes from old_comm with associated coordinates, available from MPI_Cart_coords:int coordinates[2];int my_grid_rank;MPI_Comm_rank(cart_comm, &my_grid_rank);MPI_Cart_Coords(cart_comm, my_grid_rank,2,coordinates);● Call to MPI_Comm_rank is necessary because of process rank reordering (optimization)● Processes in cart_comm are stored in row major order● Can also partition in to sub-grid(s) using MPI_Cart_sub eg., for row:int free_coords[2];MPI_Comm row_comm; //new sub­gridfree_coords[0]=0; //bool; first coordinate fixedfree_coords[1]=1; //bool; second coordinate freeMPI_Cart_sub(cart_comm,free_coords,&row_comm); wjb19@psu.edu