Medical Image Processing Strategies for multi-core CPUs

4,446
-1

Published on

Published in: Health & Medicine, Technology
2 Comments
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,446
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
214
Comments
2
Likes
3
Embeds 0
No embeds

No notes for slide
  • If I had asked this question 5 years ago, almost no one would have raised their hand.
  • Driving is inherently a parallel task, we coordinate at stop signs, stop lights, we obey the rules of the road, but we can get deadlocked (grid lock).
  • Medical Image Processing Strategies for multi-core CPUs

    1. 1. Medical image processing strategies for multi-core CPUs<br />Daniel Blezek, Mayo Clinic<br />blezek.daniel@mayo.edu<br />
    2. 2. Poll<br />Does your primary computer have more than one core...?<br />2<br />Have you ever written parallel code?<br />
    3. 3. It’s a parallel world...<br />SMP formerly was the domain of researchers<br />Thanks to Intel, now it’s everywhere!<br />3<br />... but most of us think in serial ...<br /><ul><li>Hardware has far outstripped software
    4. 4. Developers are not trained
    5. 5. Development of parallel software is difficult
    6. 6. Outside the box
    7. 7. Erlang
    8. 8. Scala
    9. 9. ...</li></ul>shoehorn<br />
    10. 10. Parallel Computing – according to Google<br />“parallel computing” 1.4M hits on Google<br />“multithreading” 10M hits<br />“multicore” 2.4M hits<br />“parallel programming” 1.1M hits<br />Why is it so hard?<br />the world is parallel<br />we all think in parallel<br />yet we are taught to program in serial<br />4<br />driving<br />
    11. 11. Degrees of parallelism (my take)<br />Serial – SISD single thread of execution<br />Data parallel – SIMD (fine grained parallelism)<br />Embarrassingly parallel – larger scale SIMD<br />CT or MR reconstruction<br />Each operation is independent, e.g. iFFT of slices<br />Worker thread – e.g. virus scanning software<br />Coarse grained parallelism – SMP or MIMD<br />Focus of this presentation, more in GPU talk<br />Concurrency, OpenMP, TBB, pthreads/Winthreads<br />Large scale – MPI on cluster, tight coupling<br />Large scale – Grid computing, loose coupling<br />5<br />
    12. 12. Pragmatic approach<br />C/C++ and Fortran are the kings of performance<br />(I’ve never written a single line of Fortran, so don’t ask)<br />“Bolted on” parallel concepts<br />Zero language support<br />Huge existing codebase<br />6<br />
    13. 13. Pragmatic approach<br />Briefly touch on SIMD<br />Introduce SMP concepts<br />Threads, concurrency<br />Development models<br />pthreads/WinThreads<br />OpenMP<br />TBB<br />ITK<br />Medical Image Processing<br />Example problems<br />Common errors<br />Next steps<br />7<br />packed<br />
    14. 14. SIMD<br />8<br />
    15. 15. SIMD – basic principles<br />9<br />http://en.wikipedia.org/wiki/SIMD<br />
    16. 16. Data structures for SIMD<br />Array of Structures<br />struct Vec {<br /> float x, y, z;<br />};<br />Vec[] points = new Vec[sz];<br />10<br />X<br />Y<br />Z<br />--<br />Pack<br />X<br />Y<br />Z<br />--<br />X<br />Y<br />Z<br />--<br />*<br />Unpack<br />X<br />Y<br />Z<br />--<br />
    17. 17. Data structures for SIMD <br />11<br />Structure of Arrays<br />struct Vec {<br /> float[] x;<br /> float[] y;<br /> float[] z;<br /> Vec ( int sz ) {<br /> x = new float[sz];<br /> y = new float[sz];<br /> z = new float[sz];<br /> };<br />};<br />Structure of Arrays<br />struct Vec{<br /> Vector4f[] v;<br /> Vec ( int sz ) {<br /> // must be word <br /> // aligned<br /> v =<br /> new Vector4f[sz];<br /> };<br />};<br />
    18. 18. SIMD pitfalls<br />Structure alignment<br />Usually needs to be aligned on word boundary<br />Structure considerations<br />May need to refactor existing code/structures<br />Generally not cross-platform<br />MMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...<br />Performance gains are modest<br />2x – 4x common<br />Limited instructions<br />Add, multiply, divide, round<br />Not suitable for branching logic<br />Autovectorizing compilers for simple loops<br />-ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)<br />12<br />
    19. 19. Threads<br />13<br />
    20. 20. 14<br />Threads – they’re everywhere<br />
    21. 21. SMP concepts<br />15<br />Useful to think in terms of “cores”<br />2 dual-core CPU = 4 “cores”<br />Cores share main memory, may share cache<br />Threads in same process share memory<br />Generally, one executing thread per core<br />Other threads sleeping<br />
    22. 22. Cores – they’re everywhere<br />16<br />How many cores does your laptop have?<br />Mine has 50(!)<br /> 2 Intel CPU (Core 2 Duo)<br /> 32 nVidia cores (9600M GT)<br /> 16 nVidia cores (9400M)<br />
    23. 23. Parallel concepts for SMP<br />Process<br />Started by the OS<br />Single thread executes “main”<br />No direct access to memory of other processes<br />Threads<br />Stream of execution under a process<br />Access to memory in containing process<br />Private memory<br />Lifetime may be less than main thread<br />Concurrency<br />Coordination between threads<br />High level (mutex, locks, barriers)<br />Low level (atomic operations)<br />17<br />
    24. 24. Processes & Threads<br />18<br />Process<br />Thread<br />NoNo<br />
    25. 25. #include &lt;pthread.h&gt;<br />// Thread work function, must return pointer to void<br />void *doWork(void *work) {<br /> // Do work<br /> return work; // equivalent to pthread_exit ( myWork );<br />}<br />...<br />pthread_t child;<br />...<br />rc=pthread_create(&child, &attr, doWork, (void *)work);<br />...<br />rc = pthread_join ( child, &threadwork );<br />...<br />Thread construction – pthread example<br />19<br />
    26. 26. Thread construction – Win32 example<br />20<br />#include &lt;windows.h&gt;<br />DWORD WINAPI doWork( LPVOID work) {};<br />...<br />PMYDATA work;<br />DWORD childID;<br />HANDLE child;<br />child = CreateThread( <br /> NULL, // default security attributes<br /> 0, // use default stack size <br /> doWork, // thread function name<br /> work, // argument to thread function <br /> 0, // use default creation flags <br /> &childID); // returns the thread identifier<br />WaitForMultipleObjects(NThreads, child, TRUE, INFINITE); <br />
    27. 27. Thread construction – Java example<br />21<br />import java.lang.Thread;<br />class Worker implements Runnable {<br /> public Worker ( Work work ) {};<br /> public void run() {}; // Do work here<br />}<br />...<br />Worker worker = new Worker ( someWork );<br />New Thread ( worker ).start();<br />
    28. 28. Race Conditions<br />22<br />Serial<br />Parallel<br />Problem!<br />nono/door<br />
    29. 29. Mutex<br />Mutex – Mutual exclusion lock<br />Protects a section of code<br />Only one thread has a lock on the object<br />Threads may<br />wait for the mutex<br />return a status if the mutex is locked<br />Semaphore<br />N threads<br />Critical Section<br />One thread executes code<br />Protects global resources<br />Maintain consistent state<br />23<br />
    30. 30. Race Conditions<br />24<br />...<br />N = 0;<br />...<br />// Start some threads<br />...<br />void* doWork() {<br /> N++; // get, incr, store<br />}<br />Mutexmutex;<br />mutex.lock();<br />mutex.release();<br />Solution w/Mutex<br />NoNo<br />
    31. 31. Atomic operations<br />Locks are not perfect<br />Cause blocking<br />Relatively heavy-weight<br />Atomic operations<br />Simple operations<br />Hardware support<br />Can implement w/Mutex<br />Conditions<br />Invisibility – no other thread knows about the change<br />Atomicity – if operation fails, return to original state<br />25<br />
    32. 32. Deadlock<br />Deadlock<br />26<br />Mutex A<br />Mutex B<br />Mutex<br />Thread<br />NoNo<br />
    33. 33. Thread synchronization – barrier<br />Initialized with the number of threads expected<br />Threads signal when they are ready<br />Wait until all expected threads are there<br />A stalled or dead thread can stall all the threads<br />27<br />
    34. 34. Thread synchronization – Condition variables<br />Workers atomically release mutex and wait<br />Master atomically releases mutex and signals<br />Workers wake up and acquire mutex<br />28<br />Mutex A<br />Working<br />Condition<br />Mutex A<br />Condition<br />Mutex A<br />Wait<br />Mutex A<br />Condition<br />Mutex A<br />Mutex<br />Thread<br />
    35. 35. Thread pool & Futures<br />29<br />Maintains a “pool” of Worker threads<br />Work queued until thread available<br />Optionally notify through a “Future”<br />Future can query status, holds return value<br />Thread returns to pool, no startup overhead<br />Core concept for OpenMP and TBB<br />
    36. 36. OpenMP<br />30<br />
    37. 37. Introduction to OpenMP<br />Scatter / gather paradigm<br />Maintains a thread pool<br />Requires compiler support<br />Visual C++, gcc 4.0, Intel Compiler<br />Easy to adapt existing serial code, easy to debug<br />Simple paradigm<br />31<br />
    38. 38. OpenMP – simple parallel sections<br />32<br />#pragmaomp parallel sections num_threads ( 5 )<br />{<br /> // 5 Threads scatter here<br /> #pragmaomp section<br /> { // Do task 1 }<br /> #pragmaomp section<br /> { // Do task 2 }<br /> ...<br /> #pragmaomp section<br /> { // Do task N }<br /> // Implicit barrier<br />}<br />Barrier<br />...<br />NoNo<br />
    39. 39. OpenMP – parallel for<br />33<br />#pragmaomp parallel for<br />for ( int i = 0; i &lt; NumberOfIterations; i++ ) {<br /> // Threads scatter here<br /> // each thread has a private copy of i<br />doSomeWork( i );<br />}<br />// Implicit barrier<br />Scheduling the iterations<br />
    40. 40. OpenMP – reduction<br />34<br />int TotalAmountOfWork = 0;<br />#pragmaomp parallel for reduction ( + : TotalAmountOfWork )<br />for ( int i = 0; i &lt; NumberOfIterations; i++ ) {<br /> // Threads scatter here<br /> // each thread has a private copy of i & TotalAmountOfWork<br />TotalAmountOfWork += doSomeWork( i );<br />}<br />// Implicit barrier<br />// TotalAmountOfWork was properly accumulated<br />// Each thread has local copy, barrier does reduction<br />// No need to use critical sections<br />
    41. 41. OpenMP – “atomic” reduction<br />35<br />int TotalAmountOfWork = 0;<br />#pragmaomp parallel for<br />for ( int i = 0; i &lt; NumberOfIterations; i++ ) {<br /> // Threads scatter here<br /> int myWork = doSomeWork( i);<br /> #pragmaomp atomic<br />TotalAmountOfWork += myWork;<br />}<br />// Implicit barrier<br />// TotalAmountOfWork was properly accumulated<br />// However, the atomic section can cause thread stalls<br />
    42. 42. OpenMP – critical<br />36<br />int TotalAmountOfWork = 0;<br />#pragmaomp parallel for reduction ( + : TotalAmountOfWork )<br />for ( int i = 0; i &lt; NumberOfIterations; i++ ) {<br /> // Threads scatter here<br /> // each thread has a private copy of i<br />TotalAmountOfWork += doSomeWork( i );<br /> #pragmaomp critical<br /> {<br /> // Execute by one thread at a time, e.g., “Mutex lock”<br />criticalOperation();<br /> }<br />}<br />// Implicit barrier <br />
    43. 43. OpenMP – single<br />37<br />int TotalAmountOfWork = 0;<br />#pragmaomp parallel for reduction ( + : TotalAmountOfWork )<br />for ( int i = 0; i &lt; NumberOfIterations; i++ ) {<br /> // Threads scatter here<br /> // each thread has a private copy of i<br />TotalAmountOfWork += doSomeWork( i );<br /> #pragmaomp single nowait<br /> {<br /> // Execute by one thread, use “master” for the main thread<br />reportProgress ( TotalAmountOfWork );<br /> } // !! No implicit barrier because of “nowait” clause !!<br />}<br />// Implicit barrier<br />
    44. 44. Threading Building Blocks (TBB)<br />38<br />
    45. 45. Introduction to TBB<br />Commercial and Open Source Licenses<br />GPL with runtime exception<br />Cross-platform C++ library<br />Similar to STL<br />Usual concurrency classes<br />Several different constructs for threading<br />for, do, reduction, pipeline<br />Finer control over scheduling<br />Maintains a thread pool to execute tasks<br />http://www.threadingbuildingblocks.org/<br />39<br />
    46. 46. TBB – parallel for <br />40<br />#include &quot;tbb/blocked_range.h”<br />#include &quot;tbb/parallel_for.h” <br />class Worker {<br /> public:<br /> Worker ( /* ... */ ) {...};<br /> void operator() ( const tbb::blocked_range&lt;int&gt;& r ) const {<br /> for ( int i = r.begin(); i != r.end(); ++i ) {<br /> doWork ( i );<br /> }<br /> }<br />};<br />...<br />tbb::parallel_for ( tbb::blocked_range&lt;int&gt; ( 0, N ),<br /> Worker ( /* ... */ ), tbb::auto_partitioner() );<br />
    47. 47. TBB – parallel reduction<br />41<br />#include &quot;tbb/blocked_range.h”<br />#include &quot;tbb/parallel_reduce.h” <br />class ReducingWorker {<br /> int mLocalWork;<br /> public:<br />ReducingWorker ( /* ... */ ) {...};<br />ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {};<br /> void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork};<br /> void operator() ( const tbb::blocked_range&lt;int&gt;& r ) { ... }<br />};<br />...<br />Worker w;<br />tbb::parallel_reduce ( tbb::blocked_range&lt;int&gt; ( 0, N ),<br /> w, tbb::auto_partitioner() );<br />w.getLocalWork();<br />
    48. 48. TBB – parallel reduction<br />42<br />
    49. 49. TBB – synchronization<br />43<br />tbb::spin_mutex MyMutex;<br />void doWork ( /* ... */ ) {<br /> // Enter critical section, exit when lock goes out of scope<br />tbb::spin_mutex::scoped_lock lock ( MyMutex );<br /> // NB: This is an error!!!<br /> // tbb::spin_mutex::scoped_lock( MyMutex );<br />}<br />...<br />#include &lt;tbb/atomic.h&gt;<br />tbb::atomic&lt;int&gt; MyCounter;<br />...<br />MyCounter = 0; // Atomic<br />int i = MyCounter; // Atomic<br />MyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic<br />...<br />MyCounter = 0; MyCounter += 2; // Watch out for other threads!<br />
    50. 50. ITK Model<br />44<br />
    51. 51. ITK Implementation<br />Threads operate across slices<br />Only implemented behavior in ITK<br />itk::MultiThreader is somewhat flexible<br />Requires that you break the ITK model<br />Uses Thread Join, higher overhead<br />No thread pool<br />45<br />
    52. 52. Comparison<br />46<br />Language specific (Java)<br />+ Fine-grain control<br />+ Cross-platform easy(?)<br />+ Many constructs<br />+/- Language-specific<br />Threads (C/C++)<br />+ Fine-grain control<br /><ul><li> Not cross-platform
    53. 53. Few constructs</li></ul>ITK<br />+ Integrated<br />+ Simple<br /><ul><li> Limited control</li></ul>+/- ITK only<br />TBB<br />+/- More complex<br />+ Fine-grain control<br />+ Intel (-?)<br />+ Open Source<br />+ Some constructs<br /><ul><li> Must re-write code</li></ul>OpenMP<br />+ Simple<br />+ Adapt existing code<br />+/- Industry standard<br />+/- Compiler support<br /><ul><li> Coarse-grain control</li></ul>diy<br />
    54. 54. Medical Imaging<br />47<br />
    55. 55. Image class<br />48<br />class Image {<br /> public:<br /> short* mData;<br /> int mWidth, mHeight, mDepth;<br /> int mVoxelsPerSlice;<br /> int mVoxelsPerVolume;<br /> short* mSlicePointers; // Pointers to the start of each slice<br /> short getVoxel ( int x, int y, int z ) {...}<br /> void setVoxel ( int x, int y, int z, short v ) {...}<br />};<br />
    56. 56. Trivial problem – threshold<br />Threshold an image<br />If intensity &gt; 100, output 1<br />otherwise output 0<br />Present from simple to complex<br />OpenMP<br />TBB<br />ITK<br />pthread(see extra slides)<br />49<br />
    57. 57. Threshold – OpenMP #1<br />50<br />void doThreshold ( Image* in, Image* out ) {<br />#pragmaomp parallel for<br /> for ( int z = 0; z &lt; in-&gt;mDepth; z++ ) {<br /> for ( int y = 0; y &lt; in-&gt;mHeight; y++ ) {<br /> for ( int x = 0; x &lt; in-&gt;mWidth; x++ ) {<br /> if ( in-&gt;getVoxel(x,y,z) &gt; 100 ) {<br /> out-&gt;setVoxel(x,y,z,1);<br /> } else {<br /> out-&gt;setVoxel(x,y,z,0);<br /> }<br /> }<br /> }<br /> }<br />}<br />// NB: can loop over slices, rows or columns by moving<br />// pragma, but must choose at compile time<br />
    58. 58. Threshold – OpenMP #2<br />51<br />void doThreshold ( Image* in, Image* out ) {<br />#pragmaomp parallel for<br /> for ( int s = 0; s &lt; in-&gt;mVoxelsPerVolume; s++ ) {<br /> if ( in-&gt;mData[s] &gt; 100 ) {<br /> out-&gt;mData[s] = 1;<br /> } else {<br /> out-&gt;mData[s] = 0;<br /> }<br /> }<br />}<br />// Likely a lot faster than previous code<br />
    59. 59. class Threshold {<br /> public:<br /> Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}<br /> void operator() ( const tbb::blocked_range&lt;int&gt;& r ) {<br /> for ( int x = r.begin(); x != r.end(); ++x ) {<br /> if ( in-&gt;mData[x] &gt; 100 ) {<br /> out-&gt;mData[x] = 1;<br /> } else {<br /> out-&gt;mData[x] = 0;<br /> }<br /> }<br /> }<br />}<br />...<br />parallel_for ( tbb::blocked_range&lt;int&gt;(0, in-&gt;mVoxelsPerVolume ),<br /> Threshold ( in, out ), auto_partitioner() );<br />// NB: default “grain size” for blocked_range is 1 pixel<br />// tbb::blocked_range&lt;int&gt;(..., in-&gt;mVoxelsPerVolume / NumberOfCPUs )<br />Threshold – TBB #1<br />52<br />
    60. 60. class Threshold {<br /> public:<br /> Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}<br /> void operator() ( const tbb::blocked_range&lt;int&gt;& r ) {...}<br /> void operator() ( const tbb::blocked_range2d&lt;int,int&gt;& r ) {<br /> for ( int z = in-&gt;mDepth; z &lt; in-&gt;mDepth; z++ ) {<br /> for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {<br /> for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){<br /> if ( in-&gt;getVoxel(x,y,z) &gt; 100 ) {<br /> out-&gt;setVoxel(x,y,z,1);<br /> } else {<br /> out-&gt;setVoxel(x,y,z,0);<br /> } } } }<br /> } };<br />...<br />parallel_for ( tbb::blocked_range2d&lt;int,int&gt;( 0, in-&gt;mHeight, 32<br /> 0, in-&gt;mWidth, 32 ),<br /> Threshold ( in, out ), auto_partitioner() ); <br />Threshold – TBB #2<br />53<br />
    61. 61. class Threshold { <br /> public:<br /> Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}<br /> void operator() ( const tbb::blocked_range&lt;int&gt;& r ) {...}<br /> void operator() ( const tbb::blocked_range2d&lt;int,int&gt;& r ) {...}<br /> void operator() ( const tbb::blocked_range3d&lt;int,int,int&gt;& r ) {<br /> for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {<br /> for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {<br /> for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){<br /> if ( in-&gt;getVoxel(x,y,z) &gt; 100 ) {<br /> out-&gt;setVoxel(x,y,z,1);<br /> } else {<br /> out-&gt;setVoxel(x,y,z,0);<br /> } } } }<br /> } };<br />...<br />parallel_for ( tbb::blocked_range3d&lt;int,int,int&gt;(0, in-&gt;mDepth, 1<br /> 0, in-&gt;mHeight, 32<br /> 0, in-&gt;mWidth, 32 ),<br /> Threshold ( in, out ), auto_partitioner() );<br />Threshold – TBB #3 <br />54<br />
    62. 62. Threshold – ITK solution<br />55<br />ThreadedGenerateData( const OutputImageRegionType out, int threadId)<br />{<br />...<br /> // Define the iterators<br />ImageRegionConstIterator&lt;TIn&gt; inputIt(inputPtr, out);<br />ImageRegionIterator&lt;TOut&gt; outputIt(outputPtr, out);<br />inputIt.GoToBegin();<br />outputIt.GoToBegin();<br /> while( !inputIt.IsAtEnd() ) <br /> {<br /> if ( inputIt.Get() &gt; 100 ) {<br />outputIt.Set ( 1 );<br /> } else {<br />outputIt.Set ( 0 );<br /> {<br /> ++inputIt;<br /> ++outputIt;<br />}<br />}<br />
    63. 63. Interesting problem – anisotropic diffusion<br />Edge preserving smoothing method<br />Perona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639<br />Iterative process<br />Demonstrate<br />OpenMP<br />TBB<br />(ITK has an implementation)<br />(pthreads are tedious at the very least)<br />Pop quiz – are the following correct?<br />56<br />
    64. 64. Anisotropic diffusion – OpenMP<br />57<br />void doAD ( Image* in, Image* out ) {<br />#pragmaomp parallel for<br /> for ( int t = 0; t &lt; TotalTime; t++ ) {<br /> for ( int z = 0; z &lt; in-&gt;mDepth; z++ ) {<br /> ...<br /> }<br /> }<br />}<br />
    65. 65. Anisotropic diffusion – OpenMP<br />58<br />void doAD ( Image* in, Image* out ) {<br /> short *previousSlice, *slice, *nextSlice;<br /> for ( int t = 0; t &lt; TotalTime; t++ ) {<br />#pragmaomp parallel for<br /> for ( int z = 1; z &lt; in-&gt;mDepth-1; z++ ) {<br />previousSlice = in-&gt;mSlicePointers[z-1];<br /> slice = in-&gt;mSlicePointers[z];<br />nextSlice = in-&gt;mSlicePointers[z+1];<br /> for ( int y = 1; y &lt; in-&gt;mHeight-1; y++ ) {<br /> short* previousRow = slice + y-1 * in-&gt;mWidth;<br /> short* row = slice + y * in-&gt;mWidth;<br /> short* nextRow = slice + y-1 * in-&gt;mWidth;<br /> short* aboveRow = previousSlice + y * in-&gt;mWidth;<br /> short* belowRow = nextSlice + y * in-&gt;mWidth;<br /> for ( int x = 1; i &lt; in-&gt;mWidth-1; x++ ) {<br />dx = 2 * row[x] – row[x-1] – row[x+1];<br />dy = 2 * row[x] – previousRow[x] – nextRow[x];<br />dz = 2 * row[x] – aboveRow[x] – belowRow[x];<br /> ...<br />
    66. 66. Anisotropic diffusion – OpenMP<br />59<br />void doAD ( Image* in, Image* out ) {<br /> for ( int t = 0; t &lt; TotalTime; t++ ) {<br />#pragmaomp parallel for<br /> for ( int z = 1; z &lt; in-&gt;mDepth-1; z++ ) {<br /> short* previousSlice = in-&gt;mSlicePointers[z-1];<br /> short* slice = in-&gt;mSlicePointers[z];<br /> short* nextSlice = in-&gt;mSlicePointers[z+1];<br /> for ( int y = 1; y &lt; in-&gt;mHeight-1; y++ ) {<br /> short* previousRow = slice + y-1 * in-&gt;mWidth;<br /> short* row = slice + y * in-&gt;mWidth;<br /> short* nextRow = slice + y-1 * in-&gt;mWidth;<br /> short* aboveRow = previousSlice + y * in-&gt;mWidth;<br /> short* belowRow = nextSlice + y * in-&gt;mWidth;<br /> for ( int x = 1; i &lt; in-&gt;mWidth-1; x++ ) {<br />dx = 2 * row[x] – row[x-1] – row[x+1];<br />dy = 2 * row[x] – previousRow[x] – nextRow[x];<br />dz = 2 * row[x] – aboveRow[x] – belowRow[x];<br /> ...<br />
    67. 67. Anisotropic diffusion – TBB #1<br />60<br />class doAD {<br /> public:<br /> static ADConstants* sConstants;<br />doAD ( Image* in, Image* out ) { ... }<br /> void operator() ( const tbb::blocked_range3d&lt;int,int,int&gt;& r ) {<br /> if ( !sConstants == NULL ) { initConstants(); }<br /> // process<br /> ...<br /> }<br />}<br />
    68. 68. Threshold – TBB #2 <br />61<br />class doAD { <br /> public:<br />doAd ( ... ) {...}<br /> void operator() ( const tbb::blocked_range3d&lt;int,int,int&gt;& r ) {<br /> for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {<br /> for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {<br /> for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){<br /> ...<br /> } };<br />...<br />parallel_for ( tbb::blocked_range3d&lt;int,int,int&gt;(0, in-&gt;mDepth<br /> 0, in-&gt;mHeight<br /> 0, in-&gt;mWidth ),<br />doAD ( in, out ), auto_partitioner() );<br />
    69. 69. Threshold – TBB #3 <br />62<br />class doAD { <br /> public:<br /> static tbb::atomic&lt;int&gt; sProgress;<br />tbb::spin_mutexmMutex;<br />doAd ( ... ) {...}<br /> void reportProgress ( int p ) { ... }<br /> void operator() ( const tbb::blocked_range3d&lt;int,int,int&gt;& r ) {<br /> for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {<br />tbb::spin_mutex::scoped_lock lock ( mMutex );<br />sProgress++;<br />reportProgress ( sProgress );<br /> for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {<br /> for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){<br /> ...<br /> } };<br />...<br />doAD::sProgress = 0;<br />parallel_for (...);<br />
    70. 70. Threshold – TBB #4 <br />63<br />class doAD { <br /> public:<br /> static tbb::atomic&lt;int&gt; sProgress;<br /> static tbb::spin_mutexmMutex;<br />doAd ( ... ) {...}<br /> void reportProgress ( int p ) { ... }<br /> void operator() ( const tbb::blocked_range3d&lt;int,int,int&gt;& r ) {<br /> for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {<br />tbb::spin_mutex::scoped_lock lock ( mMutex );<br />sProgress++;<br />reportProgress ( sProgress );<br /> for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {<br /> for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){<br /> ...<br /> } };<br />...<br />doAD::sProgress = 0;<br />parallel_for (...);<br />
    71. 71. nowait<br />Anisotropic diffusion – OpenMP (Progress)<br />64<br />using std;<br />void doAD ( Image* in, Image* out ) {<br />int progress = 0;<br />for ( int t = 0; t &lt; TotalTime; t++ ) {<br />#pragmaomp parallel for<br /> for ( int s = 0; s &lt; in-&gt;mDepth; s++ ) {<br /> #pragmaomp atomic<br /> progress++;<br /> #pragmaomp single<br />reportProgress ( progress );<br /> ...<br /> }<br /> }<br />}<br />
    72. 72. Real-life problem<br />Compute Frangi’svesselness measure<br />Frangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956<br />Memory constrained solution<br />ITK implementation requires 1.2G for 100M volume<br />Antiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)<br />Possible solutions using<br />OpenMP, TBB<br />65<br />
    73. 73. Vesselness<br />66<br />
    74. 74. ITK Implementation – computing the Hessian<br />6 volumes computed in serial<br />Individual filters are threaded<br />Good CPU usage<br />High memory requirements<br />67<br />
    75. 75. Design considerations<br />Break problem into blocks<br />Compute hessian, eigenvalues, and vesselness<br />Reduces memory requirements<br />Incurs overhead, boundary conditions<br />68<br />
    76. 76. Design considerations<br />69<br />keep cpu’s full<br />
    77. 77. Design considerations – boundary condition<br />70<br />
    78. 78. Trade-offs<br />71<br />
    79. 79. Algorithm sketch – Serial<br />72<br />intBlockSize = 32;<br />for ( intz = 0; z &lt; image-&gt;mDepth; z += BlockSize ) {<br /> for ( inty = 0; y &lt; image-&gt;mHeight; y += BlockSize ) {<br /> for ( intx = 0; x &lt; image-&gt;mWidth; x += BlockSize ) {<br />processBlock ( in, out, x, y, z, BlockSize );<br /> }<br /> }<br />}<br />
    80. 80. Algorithm sketch – OpenMP<br />73<br />intBlockSize = 32;<br />#pragmaomp parallel for<br />for ( intz = 0; z &lt; image-&gt;mDepth; z += BlockSize ) {<br /> for ( inty = 0; y &lt; image-&gt;mHeight; y += BlockSize ) {<br /> for ( intx = 0; x &lt; image-&gt;mWidth; x += BlockSize ) {<br />processBlock ( in, out, x, y, z, BlockSize );<br /> }<br /> }<br />}<br />Each thread is on a different slice<br />May cause cache contention<br />Similar problems for “y” direction<br />
    81. 81. Algorithm sketch – OpenMP<br />74<br />intBlockSize = 32;<br />for ( intz = 0; z &lt; image-&gt;mDepth; z += BlockSize ) {<br /> for ( inty = 0; y &lt; image-&gt;mHeight; y += BlockSize ) {<br />#pragmaomp parallel for<br /> for ( intx = 0; x &lt; image-&gt;mWidth; x += BlockSize ) {<br />processBlock ( in, out, x, y, z, BlockSize );<br /> }<br /> }<br />}<br />All threads on same rows<br />May not utilize all CPUs<br />If Ratio of Width to BlockSize &lt; # CPUs<br />Better cache utilization<br />
    82. 82. Algorithm sketch – TBB<br />75<br />class Vesselness { <br /> public:<br /> void operator() ( const tbb::blocked_range3d&lt;int,int,int&gt;& r ) {<br /> // Process the block, could use ITK here<br />processBlock ( <br />r.cols().begin(), r.rows().begin(), r.pages().begin(),<br />r.cols().size(), r.rows().size(), r.pages().size() );<br />...<br />parallel_for ( tbb::blocked_range3d&lt;int,int,int&gt;(<br /> 0, in-&gt;mDepth, 32<br /> 0, in-&gt;mHeight, 32<br /> 0, in-&gt;mWidth, 32 ),<br />Vesselness( in, out ), auto_partitioner() );<br />Individual blocks<br />Full CPUs<br />May not have best cache performance<br />
    83. 83. Next steps<br />Go try parallel development<br />Try threads to gain understanding and insight<br />Next OpenMP, adapting existing code<br />TBB: more constructs, different approachs<br />Experiment with new languages<br />Erlang, Scala, Reia, Chapel, X10, Fortress...<br />Check out some of the resources provided<br />Have fun! It’s a brave new world out there...<br />76<br />
    84. 84. Resources<br />TBB (http://www.threadingbuildingblocks.org/)<br />OpenMP (http://openmp.org/wp/)<br />Books/Articles<br />Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)<br />Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)<br />ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)<br />The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)<br />Tutorials<br />Parallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)<br />pthreads (https://computing.llnl.gov/tutorials/pthreads/)<br />OpenMP (https://computing.llnl.gov/tutorials/openMP/)<br />Other<br />LLNL (https://computing.llnl.gov/)<br />Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)<br />GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)<br />Intel Compiler (http://software.intel.com/en-us/intel-compilers/)<br />77<br />
    85. 85. Resources<br />Languages<br />Erlang (http://www.erlang.org/)<br />Scala (http://www.scala-lang.org/)<br />Chapel (http://chapel.cs.washington.edu/)<br />X10 (http://x10-lang.org/)<br />Unified Parallel C (http://upc.gwu.edu/)<br />Titanium (http://titanium.cs.berkeley.edu/)<br />Co-Array Fortran (http://www.co-array.org/)<br />ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)<br />High Performance Fortran (http://hpff.rice.edu/)<br />Fortress (http://projectfortress.sun.com/Projects/Community/) <br />Others (http://www.google.com/search?q=parallel+programming+language)<br />78<br />
    86. 86. Medical image processing strategies for multi-core CPUs<br />Daniel Blezek, Mayo Clinic<br />blezek.daniel@mayo.edu<br />
    87. 87. Thread construction – pthread example<br />80<br />include &lt;pthread.h&gt;<br />void *(*start_routine)(void *);<br />int<br />pthread_create(pthread_t *restrict thread,<br /> const pthread_attr_t *restrict attr,<br /> void *(*start_routine)(void *),<br /> void *restrict arg);<br />void<br />pthread_exit(void *value_ptr);<br />int<br />pthread_join(pthread_t thread, void **value_ptr);<br />
    88. 88. Mutex – pthread example<br />81<br />#include &lt;pthread.h&gt;<br />pthread_mutex_t myMutex;<br />...<br />pthread_mutex_init ( &myMutex, NULL );<br />...<br />pthread_mutex_lock ( &myMutex );<br />// Critical Section, only one thread at a time<br />...<br />pthread_mutex_unlock ( &myMutex );<br />...<br />if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) {<br /> // We did get the lock, so we are in the critical section<br /> ...<br /> pthread_mutex_unlock ( &myMutex );<br />}<br />
    89. 89. Mutex – Java example<br />82<br />import java.lang.*;<br />class Foo {<br /> public synchronized int doWork () {<br /> // only one thread can execute doWork<br /> }<br /> Object resource;<br /> public int otherWork () {<br /> synchronized ( resource ) {<br /> // critical section, resource is the mutex<br /> ...<br /> }<br />}<br />
    90. 90. struct Work { Image* in; Image *out; int start; int end; };<br />Work workArray[THREADCOUNT];<br />pthread_t thread[THREADCOUNT];<br />void* doThreshold ( void* inWork ) {<br /> Work* work = (Work*) inWork;<br /> for ( int s = work-&gt;start; s &lt; work-&gt;end; s++ ) {...}<br />}<br />...<br />pthread_attr_t attributes;<br />pthread_attr_init ( &attributes );<br />pthread_attr_setdetachstate ( &attributes, PTHREAD_CREATE_JOINABLE );<br />for ( int t = 0; t &lt; THREADCOUNT; t++ ) {<br />initializeWork ( in, out, t, workArray[t] );<br />pthread_create ( &thead[t], &attributes,<br />doThreshold, (void*) workArray[t] );<br />}<br />for ( int t = 0; t &lt; THREADCOUNT; t++ ) {<br /> pthread_join ( thread[t], NULL );<br />}<br />Threshold – pthread <br />83<br />
    91. 91. Insight Toolkit<br />84<br />
    92. 92. Semaphore<br />Allow N threads access<br />Protects limited resources<br />Binary semaphore<br />N = 1<br />Equivalent to Mutex<br />85<br />
    93. 93. ITK Implementation<br />Threads operate across slices<br />Only implemented behavior in ITK<br />itk::MultiThreader is somewhat flexible<br />Requires that you break the ITK model<br />Uses Thread Join, higher overhead<br />No thread pool<br />86<br />
    94. 94. ITK – itk::MultiTheader<br />87<br />#include &lt;itkMultiThreader.h&gt;<br />// Win32<br />DWORD doWork ( LPVOID lpThreadParameter );<br />// Pthread - Linux, Mac, Unix<br />void* doWork ( void* inWork );<br />itk::MultiThreader::Pointerthreader = itk::MultiThreader::New();<br />threader-&gt;SetNumberOfThreads ( NumberOfThreads );<br />for ( int i = 0; i &lt; NumberOfThreads; i++ ) {<br />threader-&gt;SetMultipleMethod ( i, doWork, (void*) work[i] );<br />}<br />// Explicit barrier, waits for Thread join<br />threader-&gt;MultipleMethodExecute();<br />
    95. 95. #include &lt;itkImageToImageFilter.h&gt;<br />template &lt;In, Out&gt; Worker : public ImageToImageFilter&lt;In, Out&gt; {<br />...<br /> void BeforeThreadedGenerateData() {<br /> // Master thread only<br /> ...<br /> }<br /> void ThreadedGenerateData(constOutputImageRegionType &r, int tid ){<br /> // Generate output data for r<br /> ...<br /> }<br /> voidAfterThreadedGenerateData() {<br /> // Master thread only<br /> ...<br />}<br />// Output split on last dimension<br />// i.e. Slices for 3D volumes<br />Insight Toolkit<br />88<br />
    96. 96. Anisotropic diffusion – OpenMP<br />89<br />using std;<br />void doAD ( Image* in, Image* out ) {<br />for ( int t = 0; t &lt; TotalTime; t++ ) {<br />#pragmaomp parallel for<br /> for ( int slice = 0; slice &lt; in-&gt;mDepth; slice++ ) {<br /> ...<br /> }<br /> }<br />}<br />

    ×