Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo Clinicblezek.daniel@mayo.edu
PollDoes your primary computer have more than one core...?2Have you ever written parallel code?
It’s a parallel world...SMP formerly was the domain of researchersThanks to Intel, now it’s everywhere!3... but most of us think in serial ...Hardware has far outstripped software
Developers are not trained
Development of parallel software is difficult
Outside the box
Erlang
Scala
...shoehorn
Parallel Computing – according to Google“parallel computing” 1.4M hits on Google“multithreading” 10M hits“multicore” 2.4M hits“parallel programming” 1.1M hitsWhy is it so hard?the world is parallelwe all think in parallelyet we are taught to program in serial4driving
Degrees of parallelism (my take)Serial – SISD single thread of executionData parallel – SIMD (fine grained parallelism)Embarrassingly parallel – larger scale SIMDCT or MR reconstructionEach operation is independent, e.g. iFFT of slicesWorker thread – e.g. virus scanning softwareCoarse grained parallelism – SMP or MIMDFocus of this presentation, more in GPU talkConcurrency, OpenMP, TBB, pthreads/WinthreadsLarge scale – MPI on cluster, tight couplingLarge scale – Grid computing, loose coupling5
Pragmatic approachC/C++ and Fortran are the kings of performance(I’ve never written a single line of Fortran, so don’t ask)“Bolted on” parallel conceptsZero language supportHuge existing codebase6
Pragmatic approachBriefly touch on SIMDIntroduce SMP conceptsThreads, concurrencyDevelopment modelspthreads/WinThreadsOpenMPTBBITKMedical Image ProcessingExample problemsCommon errorsNext steps7packed
SIMD8
SIMD – basic principles9http://en.wikipedia.org/wiki/SIMD
Data structures for SIMDArray of Structuresstruct Vec {	float x, y, z;};Vec[] points = new Vec[sz];10XYZ--PackXYZ--XYZ--*UnpackXYZ--
Data structures for SIMD 11Structure of Arraysstruct Vec {	float[] x;	float[] y;	float[] z;	Vec ( int sz ) {	 x = new float[sz]; y = new float[sz]; z = new float[sz];	};};Structure of Arraysstruct Vec{	Vector4f[] v;	Vec ( int sz ) {  // must be word   // aligned v =    new Vector4f[sz];	};};
SIMD pitfallsStructure alignmentUsually needs to be aligned on word boundaryStructure considerationsMay need to refactor existing code/structuresGenerally not cross-platformMMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...Performance gains are modest2x – 4x commonLimited instructionsAdd, multiply, divide, roundNot suitable for branching logicAutovectorizing compilers for simple loops-ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)12
Threads13
14Threads – they’re everywhere
SMP concepts15Useful to think in terms of “cores”2 dual-core CPU = 4 “cores”Cores share main memory, may share cacheThreads in same process share memoryGenerally, one executing thread per coreOther threads sleeping
Cores – they’re everywhere16How many cores does your laptop have?Mine has 50(!)	2 Intel CPU (Core 2 Duo)	32 nVidia cores (9600M GT)   16 nVidia cores (9400M)
Parallel concepts for SMPProcessStarted by the OSSingle thread executes “main”No direct access to memory of other processesThreadsStream of execution under a processAccess to memory in containing processPrivate memoryLifetime may be less than main threadConcurrencyCoordination between threadsHigh level (mutex, locks, barriers)Low level (atomic operations)17
Processes & Threads18ProcessThreadNoNo
#include <pthread.h>// Thread work function, must return pointer to voidvoid *doWork(void *work) {  // Do work  return work; // equivalent to pthread_exit ( myWork );}...pthread_t child;...rc=pthread_create(&child, &attr, doWork, (void *)work);...rc = pthread_join ( child, &threadwork );...Thread construction – pthread example19
Thread construction – Win32 example20#include <windows.h>DWORD WINAPI doWork( LPVOID work) {};...PMYDATA work;DWORD   childID;HANDLE  child;child = CreateThread(        NULL,         // default security attributes       0,            // use default stack size         doWork,       // thread function name       work,         // argument to thread function        0,            // use default creation flags        &childID);    // returns the thread identifierWaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
Thread construction – Java example21import java.lang.Thread;class Worker implements Runnable {	public Worker ( Work work ) {};  public void run() {}; // Do work here}...Worker worker = new Worker ( someWork );New Thread ( worker ).start();
Race Conditions22SerialParallelProblem!nono/door
MutexMutex – Mutual exclusion lockProtects a section of codeOnly one thread has a lock on the objectThreads maywait for the mutexreturn a status if the mutex is lockedSemaphoreN threadsCritical SectionOne thread executes codeProtects global resourcesMaintain consistent state23
Race Conditions24...N = 0;...// Start some threads...void* doWork() {  N++; // get, incr, store}Mutexmutex;mutex.lock();mutex.release();Solution w/MutexNoNo
Atomic operationsLocks are not perfectCause blockingRelatively heavy-weightAtomic operationsSimple operationsHardware supportCan implement w/MutexConditionsInvisibility – no other thread knows about the changeAtomicity – if operation fails, return to original state25
DeadlockDeadlock26Mutex AMutex BMutexThreadNoNo
Thread synchronization – barrierInitialized with the number of threads expectedThreads signal when they are readyWait until all expected threads are thereA stalled or dead thread can stall all the threads27
Thread synchronization – Condition variablesWorkers atomically release mutex and waitMaster atomically releases mutex and signalsWorkers wake up and acquire mutex28Mutex AWorkingConditionMutex AConditionMutex AWaitMutex AConditionMutex AMutexThread
Thread pool & Futures29Maintains a “pool” of Worker threadsWork queued until thread availableOptionally notify through a “Future”Future can query status, holds return valueThread returns to pool, no startup overheadCore concept for OpenMP and TBB
OpenMP30
Introduction to OpenMPScatter / gather paradigmMaintains a thread poolRequires compiler supportVisual C++, gcc 4.0, Intel CompilerEasy to adapt existing serial code, easy to debugSimple paradigm31
OpenMP – simple parallel sections32#pragmaomp parallel sections num_threads ( 5 ){  // 5 Threads scatter here	#pragmaomp section  { // Do task 1 }  #pragmaomp section  { // Do task 2 }  ...  #pragmaomp section  { // Do task N }  // Implicit barrier}Barrier...NoNo
OpenMP – parallel for33#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of idoSomeWork( i );}// Implicit barrierScheduling the iterations
OpenMP – reduction34int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of i & TotalAmountOfWorkTotalAmountOfWork += doSomeWork( i );}// Implicit barrier// TotalAmountOfWork was properly accumulated// Each thread has local copy, barrier does reduction// No need to use critical sections
OpenMP – “atomic” reduction35int TotalAmountOfWork = 0;#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  int myWork = doSomeWork( i);  #pragmaomp atomicTotalAmountOfWork += myWork;}// Implicit barrier// TotalAmountOfWork was properly accumulated// However, the atomic section can cause thread stalls
OpenMP – critical36int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i );  #pragmaomp critical  {    // Execute by one thread at a time, e.g., “Mutex lock”criticalOperation();  }}// Implicit barrier
OpenMP – single37int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i );  #pragmaomp single nowait  {    // Execute by one thread, use “master” for the main threadreportProgress ( TotalAmountOfWork );  }  // !! No implicit barrier because of “nowait” clause !!}// Implicit barrier
Threading Building Blocks (TBB)38
Introduction to TBBCommercial and Open Source LicensesGPL with runtime exceptionCross-platform C++ librarySimilar to STLUsual concurrency classesSeveral different constructs for threadingfor, do, reduction, pipelineFiner control over schedulingMaintains a thread pool to execute taskshttp://www.threadingbuildingblocks.org/39
TBB – parallel for	40#include "tbb/blocked_range.h”#include "tbb/parallel_for.h”	 class Worker { public:  Worker ( /* ... */ ) {...};  void operator() ( const tbb::blocked_range<int>& r ) const {    for ( int i = r.begin(); i != r.end(); ++i ) {      doWork ( i );    }  }};...tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ),	Worker ( /* ... */ ), tbb::auto_partitioner() );
TBB – parallel reduction41#include "tbb/blocked_range.h”#include "tbb/parallel_reduce.h”	 class ReducingWorker {  int mLocalWork; public:ReducingWorker ( /* ... */ ) {...};ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {};  void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork};  void operator() ( const tbb::blocked_range<int>& r ) { ... }};...Worker w;tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ),	w, tbb::auto_partitioner() );w.getLocalWork();
TBB – parallel reduction42
TBB – synchronization43tbb::spin_mutex MyMutex;void doWork ( /* ... */ ) {  // Enter critical section, exit when lock goes out of scopetbb::spin_mutex::scoped_lock lock ( MyMutex );  // NB: This is an error!!!  // tbb::spin_mutex::scoped_lock( MyMutex );}...#include <tbb/atomic.h>tbb::atomic<int> MyCounter;...MyCounter = 0;  // Atomicint i = MyCounter;  // AtomicMyCounter++; MyCounter--; ++MyCounter; --MyCounter;  // Atomic...MyCounter = 0; MyCounter += 2;  // Watch out for other threads!
ITK Model44
ITK ImplementationThreads operate across slicesOnly implemented behavior in ITKitk::MultiThreader is somewhat flexibleRequires that you break the ITK modelUses Thread Join, higher overheadNo thread pool45
Comparison46Language specific (Java)+ Fine-grain control+ Cross-platform easy(?)+ Many constructs+/- Language-specificThreads (C/C++)+ Fine-grain control Not cross-platform
 Few constructsITK+ Integrated+ Simple Limited control+/- ITK onlyTBB+/- More complex+ Fine-grain control+ Intel (-?)+ Open Source+ Some constructs Must re-write codeOpenMP+ Simple+ Adapt existing code+/- Industry standard+/- Compiler support Coarse-grain controldiy
Medical Imaging47
Image class48class Image {  public:    short* mData;    int mWidth, mHeight, mDepth;    int mVoxelsPerSlice;    int mVoxelsPerVolume;    short* mSlicePointers; // Pointers to the start of each slice    short getVoxel ( int x, int y, int z ) {...}    void setVoxel ( int x, int y, int z, short v ) {...}};
Trivial problem – thresholdThreshold an imageIf intensity > 100, output 1otherwise output 0Present from simple to complexOpenMPTBBITKpthread(see extra slides)49
Threshold – OpenMP #150void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for  for ( int z = 0; z < in->mDepth; z++ ) {    for ( int y = 0; y < in->mHeight; y++ ) {      for ( int x = 0; x < in->mWidth; x++ ) {        if ( in->getVoxel(x,y,z) > 100 ) {          out->setVoxel(x,y,z,1);        } else {          out->setVoxel(x,y,z,0);        }      }    }  }}// NB: can loop over slices, rows or columns by moving// pragma, but must choose at compile time
Threshold – OpenMP #251void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for  for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) {    if ( in->mData[s] > 100 ) {      out->mData[s] = 1;    } else {      out->mData[s] = 0;    }  }}// Likely a lot faster than previous code
class Threshold {  public:    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}    void operator() ( const tbb::blocked_range<int>& r ) {      for ( int x = r.begin(); x != r.end(); ++x ) {        if ( in->mData[x] > 100 ) {          out->mData[x] = 1;        } else {          out->mData[x] = 0;        }      }   }}...parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ),    Threshold ( in, out ), auto_partitioner() );// NB: default “grain size” for blocked_range is 1 pixel// tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )Threshold – TBB #152
class Threshold {  public:    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}    void operator() ( const tbb::blocked_range<int>& r ) {...}    void operator() ( const tbb::blocked_range2d<int,int>& r ) {      for ( int z = in->mDepth; z < in->mDepth; z++ ) {        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){            if ( in->getVoxel(x,y,z) > 100 ) {              out->setVoxel(x,y,z,1);            } else {              out->setVoxel(x,y,z,0);            } } } }    } };...parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32                                              0, in->mWidth,  32 ),    Threshold ( in, out ), auto_partitioner() );	Threshold – TBB #253
class Threshold {	  public:    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}    void operator() ( const tbb::blocked_range<int>& r ) {...}    void operator() ( const tbb::blocked_range2d<int,int>& r ) {...}    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){            if ( in->getVoxel(x,y,z) > 100 ) {              out->setVoxel(x,y,z,1);            } else {              out->setVoxel(x,y,z,0);            } } } }    } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1                                                 0, in->mHeight, 32                                                 0, in->mWidth, 32 ),    Threshold ( in, out ), auto_partitioner() );Threshold – TBB #3	54
Threshold – ITK solution55ThreadedGenerateData( const OutputImageRegionType out, int threadId){...  // Define the iteratorsImageRegionConstIterator<TIn>  inputIt(inputPtr, out);ImageRegionIterator<TOut> outputIt(outputPtr, out);inputIt.GoToBegin();outputIt.GoToBegin();  while( !inputIt.IsAtEnd() )     {    if ( inputIt.Get() > 100 ) {outputIt.Set ( 1 );    } else {outputIt.Set ( 0 );    {    ++inputIt;    ++outputIt;}}
Interesting problem – anisotropic diffusionEdge preserving smoothing methodPerona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639Iterative processDemonstrateOpenMPTBB(ITK has an implementation)(pthreads are tedious at the very least)Pop quiz – are the following correct?56
Anisotropic diffusion – OpenMP57void doAD ( Image* in, Image* out ) {#pragmaomp parallel for  for ( int t = 0; t < TotalTime; t++ ) {    for ( int z = 0; z < in->mDepth; z++ ) {      ...    }  }}
Anisotropic diffusion – OpenMP58void doAD ( Image* in, Image* out ) {  short *previousSlice, *slice, *nextSlice;  for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for    for ( int z = 1; z < in->mDepth-1; z++ ) {previousSlice = in->mSlicePointers[z-1];      slice = in->mSlicePointers[z];nextSlice = in->mSlicePointers[z+1];      for ( int y = 1; y < in->mHeight-1; y++ ) {        short* previousRow = slice + y-1 * in->mWidth;        short* row = slice + y * in->mWidth;        short* nextRow = slice + y-1 * in->mWidth;        short* aboveRow = previousSlice + y * in->mWidth;        short* belowRow = nextSlice + y * in->mWidth;        for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x];          ...
Anisotropic diffusion – OpenMP59void doAD ( Image* in, Image* out ) {  for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for    for ( int z = 1; z < in->mDepth-1; z++ ) {      short* previousSlice = in->mSlicePointers[z-1];      short* slice = in->mSlicePointers[z];      short* nextSlice = in->mSlicePointers[z+1];      for ( int y = 1; y < in->mHeight-1; y++ ) {        short* previousRow = slice + y-1 * in->mWidth;        short* row = slice + y * in->mWidth;        short* nextRow = slice + y-1 * in->mWidth;        short* aboveRow = previousSlice + y * in->mWidth;        short* belowRow = nextSlice + y * in->mWidth;        for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x];          ...
Anisotropic diffusion – TBB #160class doAD {  public:  static ADConstants* sConstants;doAD ( Image* in, Image* out ) { ... }  void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {    if ( !sConstants == NULL ) { initConstants(); }    // process    ...  }}
Threshold – TBB #2	61class doAD {	  public:doAd ( ... ) {...}    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){          ...  } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth                                                 0, in->mHeight                                                 0, in->mWidth ),doAD ( in, out ), auto_partitioner() );
Threshold – TBB #3	62class doAD {	  public:    static tbb::atomic<int> sProgress;tbb::spin_mutexmMutex;doAd ( ... ) {...}    void reportProgress ( int p ) { ... }    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress );        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){          ...  } };...doAD::sProgress = 0;parallel_for (...);
Threshold – TBB #4	63class doAD {	  public:    static tbb::atomic<int> sProgress;    static tbb::spin_mutexmMutex;doAd ( ... ) {...}    void reportProgress ( int p ) { ... }    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress );        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){          ...  } };...doAD::sProgress = 0;parallel_for (...);
nowaitAnisotropic diffusion – OpenMP (Progress)64using std;void doAD ( Image* in, Image* out ) {int progress = 0;for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for    for ( int s = 0; s < in->mDepth; s++ ) {      #pragmaomp atomic      progress++;      #pragmaomp singlereportProgress ( progress );      ...    }  }}
Real-life problemCompute Frangi’svesselness measureFrangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956Memory constrained solutionITK implementation requires 1.2G for 100M volumeAntiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)Possible solutions usingOpenMP, TBB65
Vesselness66
ITK Implementation – computing the Hessian6 volumes computed in serialIndividual filters are threadedGood CPU usageHigh memory requirements67
Design considerationsBreak problem into blocksCompute hessian, eigenvalues, and vesselnessReduces memory requirementsIncurs overhead, boundary conditions68
Design considerations69keep cpu’s full
Design considerations – boundary condition70
Trade-offs71
Algorithm sketch – Serial72intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) {  for ( inty = 0; y < image->mHeight; y += BlockSize ) {    for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize );    }  }}
Algorithm sketch – OpenMP73intBlockSize = 32;#pragmaomp parallel forfor ( intz = 0; z < image->mDepth; z += BlockSize ) {  for ( inty = 0; y < image->mHeight; y += BlockSize ) {    for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize );    }  }}Each thread is on a different sliceMay cause cache contentionSimilar problems for “y” direction
Algorithm sketch – OpenMP74intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) {  for ( inty = 0; y < image->mHeight; y += BlockSize ) {#pragmaomp parallel for    for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize );    }  }}All threads on same rowsMay not utilize all CPUsIf Ratio of Width to BlockSize < # CPUsBetter cache utilization
Algorithm sketch – TBB75class Vesselness {	  public:  void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      // Process the block, could use ITK hereprocessBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(),r.cols().size(),  r.rows().size(),  r.pages().size() );...parallel_for ( tbb::blocked_range3d<int,int,int>(                 0, in->mDepth, 32                 0, in->mHeight, 32                 0, in->mWidth, 32 ),Vesselness( in, out ), auto_partitioner() );Individual blocksFull CPUsMay not have best cache performance
Next stepsGo try parallel developmentTry threads to gain understanding and insightNext OpenMP, adapting existing codeTBB: more constructs, different approachsExperiment with new languagesErlang, Scala, Reia, Chapel, X10, Fortress...Check out some of the resources providedHave fun!  It’s a brave new world out there...76
ResourcesTBB (http://www.threadingbuildingblocks.org/)OpenMP (http://openmp.org/wp/)Books/ArticlesJava Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)TutorialsParallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)pthreads (https://computing.llnl.gov/tutorials/pthreads/)OpenMP (https://computing.llnl.gov/tutorials/openMP/)OtherLLNL (https://computing.llnl.gov/)Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)Intel Compiler (http://software.intel.com/en-us/intel-compilers/)77
ResourcesLanguagesErlang (http://www.erlang.org/)Scala (http://www.scala-lang.org/)Chapel (http://chapel.cs.washington.edu/)X10 (http://x10-lang.org/)Unified Parallel C (http://upc.gwu.edu/)Titanium (http://titanium.cs.berkeley.edu/)Co-Array Fortran (http://www.co-array.org/)ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)High Performance Fortran (http://hpff.rice.edu/)Fortress (http://projectfortress.sun.com/Projects/Community/) Others (http://www.google.com/search?q=parallel+programming+language)78
Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo Clinicblezek.daniel@mayo.edu
Thread construction – pthread example80include <pthread.h>void *(*start_routine)(void *);intpthread_create(pthread_t *restrict thread,               const pthread_attr_t *restrict attr,               void *(*start_routine)(void *),               void *restrict arg);voidpthread_exit(void *value_ptr);intpthread_join(pthread_t thread, void **value_ptr);
Mutex – pthread example81#include <pthread.h>pthread_mutex_t myMutex;...pthread_mutex_init ( &myMutex, NULL );...pthread_mutex_lock ( &myMutex );// Critical Section, only one thread at a time...pthread_mutex_unlock ( &myMutex );...if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) {  // We did get the lock, so we are in the critical section  ...  pthread_mutex_unlock ( &myMutex );}
Mutex – Java example82import java.lang.*;class Foo {  public synchronized int doWork () {	  // only one thread can execute doWork  }  Object resource;	public int otherWork () {    synchronized ( resource ) {      // critical section, resource is the mutex      ...    }}

Medical Image Processing Strategies for multi-core CPUs

  • 1.
    Medical image processingstrategies for multi-core CPUsDaniel Blezek, Mayo Clinicblezek.daniel@mayo.edu
  • 2.
    PollDoes your primarycomputer have more than one core...?2Have you ever written parallel code?
  • 3.
    It’s a parallelworld...SMP formerly was the domain of researchersThanks to Intel, now it’s everywhere!3... but most of us think in serial ...Hardware has far outstripped software
  • 4.
  • 5.
    Development of parallelsoftware is difficult
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Parallel Computing –according to Google“parallel computing” 1.4M hits on Google“multithreading” 10M hits“multicore” 2.4M hits“parallel programming” 1.1M hitsWhy is it so hard?the world is parallelwe all think in parallelyet we are taught to program in serial4driving
  • 11.
    Degrees of parallelism(my take)Serial – SISD single thread of executionData parallel – SIMD (fine grained parallelism)Embarrassingly parallel – larger scale SIMDCT or MR reconstructionEach operation is independent, e.g. iFFT of slicesWorker thread – e.g. virus scanning softwareCoarse grained parallelism – SMP or MIMDFocus of this presentation, more in GPU talkConcurrency, OpenMP, TBB, pthreads/WinthreadsLarge scale – MPI on cluster, tight couplingLarge scale – Grid computing, loose coupling5
  • 12.
    Pragmatic approachC/C++ andFortran are the kings of performance(I’ve never written a single line of Fortran, so don’t ask)“Bolted on” parallel conceptsZero language supportHuge existing codebase6
  • 13.
    Pragmatic approachBriefly touchon SIMDIntroduce SMP conceptsThreads, concurrencyDevelopment modelspthreads/WinThreadsOpenMPTBBITKMedical Image ProcessingExample problemsCommon errorsNext steps7packed
  • 14.
  • 15.
    SIMD – basicprinciples9http://en.wikipedia.org/wiki/SIMD
  • 16.
    Data structures forSIMDArray of Structuresstruct Vec { float x, y, z;};Vec[] points = new Vec[sz];10XYZ--PackXYZ--XYZ--*UnpackXYZ--
  • 17.
    Data structures forSIMD 11Structure of Arraysstruct Vec { float[] x; float[] y; float[] z; Vec ( int sz ) { x = new float[sz]; y = new float[sz]; z = new float[sz]; };};Structure of Arraysstruct Vec{ Vector4f[] v; Vec ( int sz ) { // must be word // aligned v = new Vector4f[sz]; };};
  • 18.
    SIMD pitfallsStructure alignmentUsuallyneeds to be aligned on word boundaryStructure considerationsMay need to refactor existing code/structuresGenerally not cross-platformMMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...Performance gains are modest2x – 4x commonLimited instructionsAdd, multiply, divide, roundNot suitable for branching logicAutovectorizing compilers for simple loops-ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)12
  • 19.
  • 20.
  • 21.
    SMP concepts15Useful tothink in terms of “cores”2 dual-core CPU = 4 “cores”Cores share main memory, may share cacheThreads in same process share memoryGenerally, one executing thread per coreOther threads sleeping
  • 22.
    Cores – they’reeverywhere16How many cores does your laptop have?Mine has 50(!) 2 Intel CPU (Core 2 Duo) 32 nVidia cores (9600M GT) 16 nVidia cores (9400M)
  • 23.
    Parallel concepts forSMPProcessStarted by the OSSingle thread executes “main”No direct access to memory of other processesThreadsStream of execution under a processAccess to memory in containing processPrivate memoryLifetime may be less than main threadConcurrencyCoordination between threadsHigh level (mutex, locks, barriers)Low level (atomic operations)17
  • 24.
  • 25.
    #include <pthread.h>// Threadwork function, must return pointer to voidvoid *doWork(void *work) { // Do work return work; // equivalent to pthread_exit ( myWork );}...pthread_t child;...rc=pthread_create(&child, &attr, doWork, (void *)work);...rc = pthread_join ( child, &threadwork );...Thread construction – pthread example19
  • 26.
    Thread construction –Win32 example20#include <windows.h>DWORD WINAPI doWork( LPVOID work) {};...PMYDATA work;DWORD childID;HANDLE child;child = CreateThread( NULL, // default security attributes 0, // use default stack size doWork, // thread function name work, // argument to thread function 0, // use default creation flags &childID); // returns the thread identifierWaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
  • 27.
    Thread construction –Java example21import java.lang.Thread;class Worker implements Runnable { public Worker ( Work work ) {}; public void run() {}; // Do work here}...Worker worker = new Worker ( someWork );New Thread ( worker ).start();
  • 28.
  • 29.
    MutexMutex – Mutualexclusion lockProtects a section of codeOnly one thread has a lock on the objectThreads maywait for the mutexreturn a status if the mutex is lockedSemaphoreN threadsCritical SectionOne thread executes codeProtects global resourcesMaintain consistent state23
  • 30.
    Race Conditions24...N =0;...// Start some threads...void* doWork() { N++; // get, incr, store}Mutexmutex;mutex.lock();mutex.release();Solution w/MutexNoNo
  • 31.
    Atomic operationsLocks arenot perfectCause blockingRelatively heavy-weightAtomic operationsSimple operationsHardware supportCan implement w/MutexConditionsInvisibility – no other thread knows about the changeAtomicity – if operation fails, return to original state25
  • 32.
  • 33.
    Thread synchronization –barrierInitialized with the number of threads expectedThreads signal when they are readyWait until all expected threads are thereA stalled or dead thread can stall all the threads27
  • 34.
    Thread synchronization –Condition variablesWorkers atomically release mutex and waitMaster atomically releases mutex and signalsWorkers wake up and acquire mutex28Mutex AWorkingConditionMutex AConditionMutex AWaitMutex AConditionMutex AMutexThread
  • 35.
    Thread pool &Futures29Maintains a “pool” of Worker threadsWork queued until thread availableOptionally notify through a “Future”Future can query status, holds return valueThread returns to pool, no startup overheadCore concept for OpenMP and TBB
  • 36.
  • 37.
    Introduction to OpenMPScatter/ gather paradigmMaintains a thread poolRequires compiler supportVisual C++, gcc 4.0, Intel CompilerEasy to adapt existing serial code, easy to debugSimple paradigm31
  • 38.
    OpenMP – simpleparallel sections32#pragmaomp parallel sections num_threads ( 5 ){ // 5 Threads scatter here #pragmaomp section { // Do task 1 } #pragmaomp section { // Do task 2 } ... #pragmaomp section { // Do task N } // Implicit barrier}Barrier...NoNo
  • 39.
    OpenMP – parallelfor33#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of idoSomeWork( i );}// Implicit barrierScheduling the iterations
  • 40.
    OpenMP – reduction34intTotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i & TotalAmountOfWorkTotalAmountOfWork += doSomeWork( i );}// Implicit barrier// TotalAmountOfWork was properly accumulated// Each thread has local copy, barrier does reduction// No need to use critical sections
  • 41.
    OpenMP – “atomic”reduction35int TotalAmountOfWork = 0;#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here int myWork = doSomeWork( i); #pragmaomp atomicTotalAmountOfWork += myWork;}// Implicit barrier// TotalAmountOfWork was properly accumulated// However, the atomic section can cause thread stalls
  • 42.
    OpenMP – critical36intTotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i ); #pragmaomp critical { // Execute by one thread at a time, e.g., “Mutex lock”criticalOperation(); }}// Implicit barrier
  • 43.
    OpenMP – single37intTotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i ); #pragmaomp single nowait { // Execute by one thread, use “master” for the main threadreportProgress ( TotalAmountOfWork ); } // !! No implicit barrier because of “nowait” clause !!}// Implicit barrier
  • 44.
  • 45.
    Introduction to TBBCommercialand Open Source LicensesGPL with runtime exceptionCross-platform C++ librarySimilar to STLUsual concurrency classesSeveral different constructs for threadingfor, do, reduction, pipelineFiner control over schedulingMaintains a thread pool to execute taskshttp://www.threadingbuildingblocks.org/39
  • 46.
    TBB – parallelfor 40#include "tbb/blocked_range.h”#include "tbb/parallel_for.h” class Worker { public: Worker ( /* ... */ ) {...}; void operator() ( const tbb::blocked_range<int>& r ) const { for ( int i = r.begin(); i != r.end(); ++i ) { doWork ( i ); } }};...tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ), Worker ( /* ... */ ), tbb::auto_partitioner() );
  • 47.
    TBB – parallelreduction41#include "tbb/blocked_range.h”#include "tbb/parallel_reduce.h” class ReducingWorker { int mLocalWork; public:ReducingWorker ( /* ... */ ) {...};ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {}; void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork}; void operator() ( const tbb::blocked_range<int>& r ) { ... }};...Worker w;tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ), w, tbb::auto_partitioner() );w.getLocalWork();
  • 48.
    TBB – parallelreduction42
  • 49.
    TBB – synchronization43tbb::spin_mutexMyMutex;void doWork ( /* ... */ ) { // Enter critical section, exit when lock goes out of scopetbb::spin_mutex::scoped_lock lock ( MyMutex ); // NB: This is an error!!! // tbb::spin_mutex::scoped_lock( MyMutex );}...#include <tbb/atomic.h>tbb::atomic<int> MyCounter;...MyCounter = 0; // Atomicint i = MyCounter; // AtomicMyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic...MyCounter = 0; MyCounter += 2; // Watch out for other threads!
  • 50.
  • 51.
    ITK ImplementationThreads operateacross slicesOnly implemented behavior in ITKitk::MultiThreader is somewhat flexibleRequires that you break the ITK modelUses Thread Join, higher overheadNo thread pool45
  • 52.
    Comparison46Language specific (Java)+Fine-grain control+ Cross-platform easy(?)+ Many constructs+/- Language-specificThreads (C/C++)+ Fine-grain control Not cross-platform
  • 53.
    Few constructsITK+Integrated+ Simple Limited control+/- ITK onlyTBB+/- More complex+ Fine-grain control+ Intel (-?)+ Open Source+ Some constructs Must re-write codeOpenMP+ Simple+ Adapt existing code+/- Industry standard+/- Compiler support Coarse-grain controldiy
  • 54.
  • 55.
    Image class48class Image{ public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...}};
  • 56.
    Trivial problem –thresholdThreshold an imageIf intensity > 100, output 1otherwise output 0Present from simple to complexOpenMPTBBITKpthread(see extra slides)49
  • 57.
    Threshold – OpenMP#150void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } }}// NB: can loop over slices, rows or columns by moving// pragma, but must choose at compile time
  • 58.
    Threshold – OpenMP#251void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } }}// Likely a lot faster than previous code
  • 59.
    class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) { for ( int x = r.begin(); x != r.end(); ++x ) { if ( in->mData[x] > 100 ) { out->mData[x] = 1; } else { out->mData[x] = 0; } } }}...parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ), Threshold ( in, out ), auto_partitioner() );// NB: default “grain size” for blocked_range is 1 pixel// tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )Threshold – TBB #152
  • 60.
    class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) { for ( int z = in->mDepth; z < in->mDepth; z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #253
  • 61.
    class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() );Threshold – TBB #3 54
  • 62.
    Threshold – ITKsolution55ThreadedGenerateData( const OutputImageRegionType out, int threadId){... // Define the iteratorsImageRegionConstIterator<TIn> inputIt(inputPtr, out);ImageRegionIterator<TOut> outputIt(outputPtr, out);inputIt.GoToBegin();outputIt.GoToBegin(); while( !inputIt.IsAtEnd() ) { if ( inputIt.Get() > 100 ) {outputIt.Set ( 1 ); } else {outputIt.Set ( 0 ); { ++inputIt; ++outputIt;}}
  • 63.
    Interesting problem –anisotropic diffusionEdge preserving smoothing methodPerona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639Iterative processDemonstrateOpenMPTBB(ITK has an implementation)(pthreads are tedious at the very least)Pop quiz – are the following correct?56
  • 64.
    Anisotropic diffusion –OpenMP57void doAD ( Image* in, Image* out ) {#pragmaomp parallel for for ( int t = 0; t < TotalTime; t++ ) { for ( int z = 0; z < in->mDepth; z++ ) { ... } }}
  • 65.
    Anisotropic diffusion –OpenMP58void doAD ( Image* in, Image* out ) { short *previousSlice, *slice, *nextSlice; for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) {previousSlice = in->mSlicePointers[z-1]; slice = in->mSlicePointers[z];nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
  • 66.
    Anisotropic diffusion –OpenMP59void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { short* previousSlice = in->mSlicePointers[z-1]; short* slice = in->mSlicePointers[z]; short* nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
  • 67.
    Anisotropic diffusion –TBB #160class doAD { public: static ADConstants* sConstants;doAD ( Image* in, Image* out ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { if ( !sConstants == NULL ) { initConstants(); } // process ... }}
  • 68.
    Threshold – TBB#2 61class doAD { public:doAd ( ... ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth 0, in->mHeight 0, in->mWidth ),doAD ( in, out ), auto_partitioner() );
  • 69.
    Threshold – TBB#3 62class doAD { public: static tbb::atomic<int> sProgress;tbb::spin_mutexmMutex;doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);
  • 70.
    Threshold – TBB#4 63class doAD { public: static tbb::atomic<int> sProgress; static tbb::spin_mutexmMutex;doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);
  • 71.
    nowaitAnisotropic diffusion –OpenMP (Progress)64using std;void doAD ( Image* in, Image* out ) {int progress = 0;for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int s = 0; s < in->mDepth; s++ ) { #pragmaomp atomic progress++; #pragmaomp singlereportProgress ( progress ); ... } }}
  • 72.
    Real-life problemCompute Frangi’svesselnessmeasureFrangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956Memory constrained solutionITK implementation requires 1.2G for 100M volumeAntiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)Possible solutions usingOpenMP, TBB65
  • 73.
  • 74.
    ITK Implementation –computing the Hessian6 volumes computed in serialIndividual filters are threadedGood CPU usageHigh memory requirements67
  • 75.
    Design considerationsBreak probleminto blocksCompute hessian, eigenvalues, and vesselnessReduces memory requirementsIncurs overhead, boundary conditions68
  • 76.
  • 77.
    Design considerations –boundary condition70
  • 78.
  • 79.
    Algorithm sketch –Serial72intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}
  • 80.
    Algorithm sketch –OpenMP73intBlockSize = 32;#pragmaomp parallel forfor ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}Each thread is on a different sliceMay cause cache contentionSimilar problems for “y” direction
  • 81.
    Algorithm sketch –OpenMP74intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) {#pragmaomp parallel for for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}All threads on same rowsMay not utilize all CPUsIf Ratio of Width to BlockSize < # CPUsBetter cache utilization
  • 82.
    Algorithm sketch –TBB75class Vesselness { public: void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { // Process the block, could use ITK hereprocessBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(),r.cols().size(), r.rows().size(), r.pages().size() );...parallel_for ( tbb::blocked_range3d<int,int,int>( 0, in->mDepth, 32 0, in->mHeight, 32 0, in->mWidth, 32 ),Vesselness( in, out ), auto_partitioner() );Individual blocksFull CPUsMay not have best cache performance
  • 83.
    Next stepsGo tryparallel developmentTry threads to gain understanding and insightNext OpenMP, adapting existing codeTBB: more constructs, different approachsExperiment with new languagesErlang, Scala, Reia, Chapel, X10, Fortress...Check out some of the resources providedHave fun! It’s a brave new world out there...76
  • 84.
    ResourcesTBB (http://www.threadingbuildingblocks.org/)OpenMP (http://openmp.org/wp/)Books/ArticlesJavaConcurrency in Practice (http://www.javaconcurrencyinpractice.com/)Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)TutorialsParallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)pthreads (https://computing.llnl.gov/tutorials/pthreads/)OpenMP (https://computing.llnl.gov/tutorials/openMP/)OtherLLNL (https://computing.llnl.gov/)Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)Intel Compiler (http://software.intel.com/en-us/intel-compilers/)77
  • 85.
    ResourcesLanguagesErlang (http://www.erlang.org/)Scala (http://www.scala-lang.org/)Chapel(http://chapel.cs.washington.edu/)X10 (http://x10-lang.org/)Unified Parallel C (http://upc.gwu.edu/)Titanium (http://titanium.cs.berkeley.edu/)Co-Array Fortran (http://www.co-array.org/)ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)High Performance Fortran (http://hpff.rice.edu/)Fortress (http://projectfortress.sun.com/Projects/Community/) Others (http://www.google.com/search?q=parallel+programming+language)78
  • 86.
    Medical image processingstrategies for multi-core CPUsDaniel Blezek, Mayo Clinicblezek.daniel@mayo.edu
  • 87.
    Thread construction –pthread example80include <pthread.h>void *(*start_routine)(void *);intpthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, void *(*start_routine)(void *), void *restrict arg);voidpthread_exit(void *value_ptr);intpthread_join(pthread_t thread, void **value_ptr);
  • 88.
    Mutex – pthreadexample81#include <pthread.h>pthread_mutex_t myMutex;...pthread_mutex_init ( &myMutex, NULL );...pthread_mutex_lock ( &myMutex );// Critical Section, only one thread at a time...pthread_mutex_unlock ( &myMutex );...if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) { // We did get the lock, so we are in the critical section ... pthread_mutex_unlock ( &myMutex );}
  • 89.
    Mutex – Javaexample82import java.lang.*;class Foo { public synchronized int doWork () { // only one thread can execute doWork } Object resource; public int otherWork () { synchronized ( resource ) { // critical section, resource is the mutex ... }}

Editor's Notes

  • #3 If I had asked this question 5 years ago, almost no one would have raised their hand.
  • #5 Driving is inherently a parallel task, we coordinate at stop signs, stop lights, we obey the rules of the road, but we can get deadlocked (grid lock).