Multicore   Birgit Plötzeneder, 11/24/10
Intro (Why?) Architecture Languages OMP MPI Tools
Darling,  I shrunk the  computer.  * copyright by Prof. Erik Hagersten /   Uppsala, who does   awesome work Signal propagation delay » transistor delay Not enough ILP  for more transistors Power consumption
O RLY? You want FASTER code. NOW. - prefetching  - high comp load  - image/video  - fun
 
Intel Core 2 Quad
AMD  Shanghai (K10)
Intel  Dunnington (Xeon 74xx)
Intel  i7
AMD  Magny Cours
The Secret..
Moving from  1 core  to   4 cores  can give you a factor of
Moving from m emory  to   L 1   can give you a factor of
Disabling  the L2  cache  will  reduce  system performance   more than   disabling  a second CPU core    of a dual-core processor 

* see Iris Christadler, LRZ
OMP and MPI
Program start : only  master   thread  runs Parallel region : team of worker  threads is generated (“fork”)  Threads  synchronize  when leaving parallel region (“join”)  OpenMP-Concept
A First Program
Work-sharing constructs omp for or omp do sections  single  master
Data sharing attribute clauses shared :  visible and accessible by all threads simultaneously. Default  (!i) . a[i]=a[i-1].. private :  each thread will have a local copy, value is not maintained for use outside  firstprivate :  like private except initialized to original value. lastprivate :  like private except original value is updated after construct. reduction  (->reduction ops)
Scheduling clauses   schedule(type, chunk): static dynamic guided
Other clauses critical :  executed by only one thread at a time atomic :  similar to critical section, but may be  better ordered :  executed in the order in which iterations would be executed in a sequential loop b arrier nowait
Using clauses
 
MPI-Concept mpicc <options>  prog.c mpirun  -arch <architecture>  -np<np>  prog
MPI
MPI
MPI program: 6 basic calls MPI messages Communicators MPI MPI_INIT MPI_COMM_RANK MPI_COMM_SIZE MPI_SEND MPI_RECV MPI_FINALIZE data (startbuf, count, datatype) envelope (destination/source, tag, communicatior)
Communication  modes Collective vs P2P One2All, All2All, All2One Blocking / Nonblocking Synchronous / Asynchronous
Communication  modes synchronous mode (&quot;safest&quot;): Is the receiver ready? ready mode (lowest system overhead)- only if there is a receiver waiting (streaming) buffered mode (decouples sender from receiver), buffer size, buffer attachment! standard mode
  Communication Mode Blocking  Routines Non-Blocking Routines   synchronous MPI_SSEND MPI_ISSEND   ready MPI_RSEND MPI_IRSEND   buffered MPI_BSEND MPI_IBSEND   standard MPI_SEND MPI_ISEND   MPI_RECV MPI_IRECV   MPI_SENDRECV   MPI_SENDRECV_REPLACE
Collective   communication Barrier Broadcast Gather Scatter Reduction
gpro f valgrind PAPI
PAPI PAPI  is  a library  that monitors hardware events when a program runs.  Papiex  is  a tool  that makes it easy to  get access to performance counters using PAPI .* * http://icl.cs.utk.edu/papi / papiex –e  <EVENT>  ./my_prog  (to turn of optimizations (use the flag  -O0) for some tests)
Profilers Two Types Statistical   Profilers Event Based Profilers Statistical Profiling : Interrupts at  random intervals  and records which program instruction the CPU is executing. Event Based Profiling : Interrupts triggered by hardware counter events are recorded. Measuring profiles affects performance. Still a lot of data saved.
Tracing Wrappers for function calls (for example MPI_Recv) Records  when  a function was called and  with what parameters Which nodes exchanged messages, message size
 Can affect performance
Intel tracing tools Marmot   MPI correctness and portability checker  MpiP   -  http://mpip.sourceforge.net/
Extrae + Paraver module add paraver mpi2prv  -f  TRACE.mpits -o MPImatrix.prv v Scalasca Screenshots and examples of profilers/tracing tools  available – but not on the internet. v
This talk was given to the TumFUG Linux/Unix-User group at the TU MĂŒnchen. Contact me via  [email_address] You may use the pictures of the processors (not the screenshots, not the overview pic which I only adapted), but please do notify and credit me accordingly.  Some of the code was copy-pasted from Wikipedia. I've removed copy-right problematic parts.

Multicore

  • 1.
    Multicore Birgit Plötzeneder, 11/24/10
  • 2.
    Intro (Why?) ArchitectureLanguages OMP MPI Tools
  • 3.
    Darling, Ishrunk the computer. * copyright by Prof. Erik Hagersten / Uppsala, who does awesome work Signal propagation delay » transistor delay Not enough ILP for more transistors Power consumption
  • 4.
    O RLY? Youwant FASTER code. NOW. - prefetching - high comp load - image/video - fun
  • 5.
  • 6.
  • 7.
  • 8.
    Intel Dunnington(Xeon 74xx)
  • 9.
  • 10.
    AMD MagnyCours
  • 11.
  • 12.
    Moving from 1 core to 4 cores can give you a factor of
  • 13.
    Moving from memory to L 1 can give you a factor of
  • 14.
    Disabling theL2 cache will reduce system performance more than disabling a second CPU core of a dual-core processor 

  • 15.
    * see IrisChristadler, LRZ
  • 16.
  • 17.
    Program start :only master thread runs Parallel region : team of worker threads is generated (“fork”) Threads synchronize when leaving parallel region (“join”) OpenMP-Concept
  • 18.
  • 19.
    Work-sharing constructs ompfor or omp do sections single master
  • 20.
    Data sharing attributeclauses shared : visible and accessible by all threads simultaneously. Default (!i) . a[i]=a[i-1].. private : each thread will have a local copy, value is not maintained for use outside firstprivate : like private except initialized to original value. lastprivate : like private except original value is updated after construct. reduction (->reduction ops)
  • 21.
    Scheduling clauses schedule(type, chunk): static dynamic guided
  • 22.
    Other clauses critical: executed by only one thread at a time atomic : similar to critical section, but may be better ordered : executed in the order in which iterations would be executed in a sequential loop b arrier nowait
  • 23.
  • 24.
  • 25.
    MPI-Concept mpicc <options> prog.c mpirun -arch <architecture> -np<np> prog
  • 26.
  • 27.
  • 28.
    MPI program: 6basic calls MPI messages Communicators MPI MPI_INIT MPI_COMM_RANK MPI_COMM_SIZE MPI_SEND MPI_RECV MPI_FINALIZE data (startbuf, count, datatype) envelope (destination/source, tag, communicatior)
  • 29.
    Communication modesCollective vs P2P One2All, All2All, All2One Blocking / Nonblocking Synchronous / Asynchronous
  • 30.
    Communication modessynchronous mode (&quot;safest&quot;): Is the receiver ready? ready mode (lowest system overhead)- only if there is a receiver waiting (streaming) buffered mode (decouples sender from receiver), buffer size, buffer attachment! standard mode
  • 31.
      Communication ModeBlocking Routines Non-Blocking Routines   synchronous MPI_SSEND MPI_ISSEND   ready MPI_RSEND MPI_IRSEND   buffered MPI_BSEND MPI_IBSEND   standard MPI_SEND MPI_ISEND   MPI_RECV MPI_IRECV   MPI_SENDRECV   MPI_SENDRECV_REPLACE
  • 32.
    Collective communication Barrier Broadcast Gather Scatter Reduction
  • 33.
  • 34.
    PAPI PAPI is a library that monitors hardware events when a program runs. Papiex is a tool that makes it easy to get access to performance counters using PAPI .* * http://icl.cs.utk.edu/papi / papiex –e <EVENT> ./my_prog (to turn of optimizations (use the flag -O0) for some tests)
  • 35.
    Profilers Two TypesStatistical Profilers Event Based Profilers Statistical Profiling : Interrupts at random intervals and records which program instruction the CPU is executing. Event Based Profiling : Interrupts triggered by hardware counter events are recorded. Measuring profiles affects performance. Still a lot of data saved.
  • 36.
    Tracing Wrappers forfunction calls (for example MPI_Recv) Records when a function was called and with what parameters Which nodes exchanged messages, message size
 Can affect performance
  • 37.
    Intel tracing toolsMarmot MPI correctness and portability checker MpiP - http://mpip.sourceforge.net/
  • 38.
    Extrae + Paravermodule add paraver mpi2prv -f TRACE.mpits -o MPImatrix.prv v Scalasca Screenshots and examples of profilers/tracing tools available – but not on the internet. v
  • 39.
    This talk wasgiven to the TumFUG Linux/Unix-User group at the TU MĂŒnchen. Contact me via [email_address] You may use the pictures of the processors (not the screenshots, not the overview pic which I only adapted), but please do notify and credit me accordingly. Some of the code was copy-pasted from Wikipedia. I've removed copy-right problematic parts.