• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,586
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
184
Comments
2
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • If I had asked this question 5 years ago, almost no one would have raised their hand.
  • Driving is inherently a parallel task, we coordinate at stop signs, stop lights, we obey the rules of the road, but we can get deadlocked (grid lock).

Transcript

  • 1. Medical image processing strategies for multi-core CPUs
    Daniel Blezek, Mayo Clinic
    blezek.daniel@mayo.edu
  • 2. Poll
    Does your primary computer have more than one core...?
    2
    Have you ever written parallel code?
  • 3. It’s a parallel world...
    SMP formerly was the domain of researchers
    Thanks to Intel, now it’s everywhere!
    3
    ... but most of us think in serial ...
    • Hardware has far outstripped software
    • 4. Developers are not trained
    • 5. Development of parallel software is difficult
    • 6. Outside the box
    • 7. Erlang
    • 8. Scala
    • 9. ...
    shoehorn
  • 10. Parallel Computing – according to Google
    “parallel computing” 1.4M hits on Google
    “multithreading” 10M hits
    “multicore” 2.4M hits
    “parallel programming” 1.1M hits
    Why is it so hard?
    the world is parallel
    we all think in parallel
    yet we are taught to program in serial
    4
    driving
  • 11. Degrees of parallelism (my take)
    Serial – SISD single thread of execution
    Data parallel – SIMD (fine grained parallelism)
    Embarrassingly parallel – larger scale SIMD
    CT or MR reconstruction
    Each operation is independent, e.g. iFFT of slices
    Worker thread – e.g. virus scanning software
    Coarse grained parallelism – SMP or MIMD
    Focus of this presentation, more in GPU talk
    Concurrency, OpenMP, TBB, pthreads/Winthreads
    Large scale – MPI on cluster, tight coupling
    Large scale – Grid computing, loose coupling
    5
  • 12. Pragmatic approach
    C/C++ and Fortran are the kings of performance
    (I’ve never written a single line of Fortran, so don’t ask)
    “Bolted on” parallel concepts
    Zero language support
    Huge existing codebase
    6
  • 13. Pragmatic approach
    Briefly touch on SIMD
    Introduce SMP concepts
    Threads, concurrency
    Development models
    pthreads/WinThreads
    OpenMP
    TBB
    ITK
    Medical Image Processing
    Example problems
    Common errors
    Next steps
    7
    packed
  • 14. SIMD
    8
  • 15. SIMD – basic principles
    9
    http://en.wikipedia.org/wiki/SIMD
  • 16. Data structures for SIMD
    Array of Structures
    struct Vec {
    float x, y, z;
    };
    Vec[] points = new Vec[sz];
    10
    X
    Y
    Z
    --
    Pack
    X
    Y
    Z
    --
    X
    Y
    Z
    --
    *
    Unpack
    X
    Y
    Z
    --
  • 17. Data structures for SIMD
    11
    Structure of Arrays
    struct Vec {
    float[] x;
    float[] y;
    float[] z;
    Vec ( int sz ) {
    x = new float[sz];
    y = new float[sz];
    z = new float[sz];
    };
    };
    Structure of Arrays
    struct Vec{
    Vector4f[] v;
    Vec ( int sz ) {
    // must be word
    // aligned
    v =
    new Vector4f[sz];
    };
    };
  • 18. SIMD pitfalls
    Structure alignment
    Usually needs to be aligned on word boundary
    Structure considerations
    May need to refactor existing code/structures
    Generally not cross-platform
    MMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...
    Performance gains are modest
    2x – 4x common
    Limited instructions
    Add, multiply, divide, round
    Not suitable for branching logic
    Autovectorizing compilers for simple loops
    -ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)
    12
  • 19. Threads
    13
  • 20. 14
    Threads – they’re everywhere
  • 21. SMP concepts
    15
    Useful to think in terms of “cores”
    2 dual-core CPU = 4 “cores”
    Cores share main memory, may share cache
    Threads in same process share memory
    Generally, one executing thread per core
    Other threads sleeping
  • 22. Cores – they’re everywhere
    16
    How many cores does your laptop have?
    Mine has 50(!)
    2 Intel CPU (Core 2 Duo)
    32 nVidia cores (9600M GT)
    16 nVidia cores (9400M)
  • 23. Parallel concepts for SMP
    Process
    Started by the OS
    Single thread executes “main”
    No direct access to memory of other processes
    Threads
    Stream of execution under a process
    Access to memory in containing process
    Private memory
    Lifetime may be less than main thread
    Concurrency
    Coordination between threads
    High level (mutex, locks, barriers)
    Low level (atomic operations)
    17
  • 24. Processes & Threads
    18
    Process
    Thread
    NoNo
  • 25. #include <pthread.h>
    // Thread work function, must return pointer to void
    void *doWork(void *work) {
    // Do work
    return work; // equivalent to pthread_exit ( myWork );
    }
    ...
    pthread_t child;
    ...
    rc=pthread_create(&child, &attr, doWork, (void *)work);
    ...
    rc = pthread_join ( child, &threadwork );
    ...
    Thread construction – pthread example
    19
  • 26. Thread construction – Win32 example
    20
    #include <windows.h>
    DWORD WINAPI doWork( LPVOID work) {};
    ...
    PMYDATA work;
    DWORD childID;
    HANDLE child;
    child = CreateThread(
    NULL, // default security attributes
    0, // use default stack size
    doWork, // thread function name
    work, // argument to thread function
    0, // use default creation flags
    &childID); // returns the thread identifier
    WaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
  • 27. Thread construction – Java example
    21
    import java.lang.Thread;
    class Worker implements Runnable {
    public Worker ( Work work ) {};
    public void run() {}; // Do work here
    }
    ...
    Worker worker = new Worker ( someWork );
    New Thread ( worker ).start();
  • 28. Race Conditions
    22
    Serial
    Parallel
    Problem!
    nono/door
  • 29. Mutex
    Mutex – Mutual exclusion lock
    Protects a section of code
    Only one thread has a lock on the object
    Threads may
    wait for the mutex
    return a status if the mutex is locked
    Semaphore
    N threads
    Critical Section
    One thread executes code
    Protects global resources
    Maintain consistent state
    23
  • 30. Race Conditions
    24
    ...
    N = 0;
    ...
    // Start some threads
    ...
    void* doWork() {
    N++; // get, incr, store
    }
    Mutexmutex;
    mutex.lock();
    mutex.release();
    Solution w/Mutex
    NoNo
  • 31. Atomic operations
    Locks are not perfect
    Cause blocking
    Relatively heavy-weight
    Atomic operations
    Simple operations
    Hardware support
    Can implement w/Mutex
    Conditions
    Invisibility – no other thread knows about the change
    Atomicity – if operation fails, return to original state
    25
  • 32. Deadlock
    Deadlock
    26
    Mutex A
    Mutex B
    Mutex
    Thread
    NoNo
  • 33. Thread synchronization – barrier
    Initialized with the number of threads expected
    Threads signal when they are ready
    Wait until all expected threads are there
    A stalled or dead thread can stall all the threads
    27
  • 34. Thread synchronization – Condition variables
    Workers atomically release mutex and wait
    Master atomically releases mutex and signals
    Workers wake up and acquire mutex
    28
    Mutex A
    Working
    Condition
    Mutex A
    Condition
    Mutex A
    Wait
    Mutex A
    Condition
    Mutex A
    Mutex
    Thread
  • 35. Thread pool & Futures
    29
    Maintains a “pool” of Worker threads
    Work queued until thread available
    Optionally notify through a “Future”
    Future can query status, holds return value
    Thread returns to pool, no startup overhead
    Core concept for OpenMP and TBB
  • 36. OpenMP
    30
  • 37. Introduction to OpenMP
    Scatter / gather paradigm
    Maintains a thread pool
    Requires compiler support
    Visual C++, gcc 4.0, Intel Compiler
    Easy to adapt existing serial code, easy to debug
    Simple paradigm
    31
  • 38. OpenMP – simple parallel sections
    32
    #pragmaomp parallel sections num_threads ( 5 )
    {
    // 5 Threads scatter here
    #pragmaomp section
    { // Do task 1 }
    #pragmaomp section
    { // Do task 2 }
    ...
    #pragmaomp section
    { // Do task N }
    // Implicit barrier
    }
    Barrier
    ...
    NoNo
  • 39. OpenMP – parallel for
    33
    #pragmaomp parallel for
    for ( int i = 0; i < NumberOfIterations; i++ ) {
    // Threads scatter here
    // each thread has a private copy of i
    doSomeWork( i );
    }
    // Implicit barrier
    Scheduling the iterations
  • 40. OpenMP – reduction
    34
    int TotalAmountOfWork = 0;
    #pragmaomp parallel for reduction ( + : TotalAmountOfWork )
    for ( int i = 0; i < NumberOfIterations; i++ ) {
    // Threads scatter here
    // each thread has a private copy of i & TotalAmountOfWork
    TotalAmountOfWork += doSomeWork( i );
    }
    // Implicit barrier
    // TotalAmountOfWork was properly accumulated
    // Each thread has local copy, barrier does reduction
    // No need to use critical sections
  • 41. OpenMP – “atomic” reduction
    35
    int TotalAmountOfWork = 0;
    #pragmaomp parallel for
    for ( int i = 0; i < NumberOfIterations; i++ ) {
    // Threads scatter here
    int myWork = doSomeWork( i);
    #pragmaomp atomic
    TotalAmountOfWork += myWork;
    }
    // Implicit barrier
    // TotalAmountOfWork was properly accumulated
    // However, the atomic section can cause thread stalls
  • 42. OpenMP – critical
    36
    int TotalAmountOfWork = 0;
    #pragmaomp parallel for reduction ( + : TotalAmountOfWork )
    for ( int i = 0; i < NumberOfIterations; i++ ) {
    // Threads scatter here
    // each thread has a private copy of i
    TotalAmountOfWork += doSomeWork( i );
    #pragmaomp critical
    {
    // Execute by one thread at a time, e.g., “Mutex lock”
    criticalOperation();
    }
    }
    // Implicit barrier
  • 43. OpenMP – single
    37
    int TotalAmountOfWork = 0;
    #pragmaomp parallel for reduction ( + : TotalAmountOfWork )
    for ( int i = 0; i < NumberOfIterations; i++ ) {
    // Threads scatter here
    // each thread has a private copy of i
    TotalAmountOfWork += doSomeWork( i );
    #pragmaomp single nowait
    {
    // Execute by one thread, use “master” for the main thread
    reportProgress ( TotalAmountOfWork );
    } // !! No implicit barrier because of “nowait” clause !!
    }
    // Implicit barrier
  • 44. Threading Building Blocks (TBB)
    38
  • 45. Introduction to TBB
    Commercial and Open Source Licenses
    GPL with runtime exception
    Cross-platform C++ library
    Similar to STL
    Usual concurrency classes
    Several different constructs for threading
    for, do, reduction, pipeline
    Finer control over scheduling
    Maintains a thread pool to execute tasks
    http://www.threadingbuildingblocks.org/
    39
  • 46. TBB – parallel for
    40
    #include "tbb/blocked_range.h”
    #include "tbb/parallel_for.h”
    class Worker {
    public:
    Worker ( /* ... */ ) {...};
    void operator() ( const tbb::blocked_range<int>& r ) const {
    for ( int i = r.begin(); i != r.end(); ++i ) {
    doWork ( i );
    }
    }
    };
    ...
    tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ),
    Worker ( /* ... */ ), tbb::auto_partitioner() );
  • 47. TBB – parallel reduction
    41
    #include "tbb/blocked_range.h”
    #include "tbb/parallel_reduce.h”
    class ReducingWorker {
    int mLocalWork;
    public:
    ReducingWorker ( /* ... */ ) {...};
    ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {};
    void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork};
    void operator() ( const tbb::blocked_range<int>& r ) { ... }
    };
    ...
    Worker w;
    tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ),
    w, tbb::auto_partitioner() );
    w.getLocalWork();
  • 48. TBB – parallel reduction
    42
  • 49. TBB – synchronization
    43
    tbb::spin_mutex MyMutex;
    void doWork ( /* ... */ ) {
    // Enter critical section, exit when lock goes out of scope
    tbb::spin_mutex::scoped_lock lock ( MyMutex );
    // NB: This is an error!!!
    // tbb::spin_mutex::scoped_lock( MyMutex );
    }
    ...
    #include <tbb/atomic.h>
    tbb::atomic<int> MyCounter;
    ...
    MyCounter = 0; // Atomic
    int i = MyCounter; // Atomic
    MyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic
    ...
    MyCounter = 0; MyCounter += 2; // Watch out for other threads!
  • 50. ITK Model
    44
  • 51. ITK Implementation
    Threads operate across slices
    Only implemented behavior in ITK
    itk::MultiThreader is somewhat flexible
    Requires that you break the ITK model
    Uses Thread Join, higher overhead
    No thread pool
    45
  • 52. Comparison
    46
    Language specific (Java)
    + Fine-grain control
    + Cross-platform easy(?)
    + Many constructs
    +/- Language-specific
    Threads (C/C++)
    + Fine-grain control
    • Not cross-platform
    • 53. Few constructs
    ITK
    + Integrated
    + Simple
    • Limited control
    +/- ITK only
    TBB
    +/- More complex
    + Fine-grain control
    + Intel (-?)
    + Open Source
    + Some constructs
    • Must re-write code
    OpenMP
    + Simple
    + Adapt existing code
    +/- Industry standard
    +/- Compiler support
    • Coarse-grain control
    diy
  • 54. Medical Imaging
    47
  • 55. Image class
    48
    class Image {
    public:
    short* mData;
    int mWidth, mHeight, mDepth;
    int mVoxelsPerSlice;
    int mVoxelsPerVolume;
    short* mSlicePointers; // Pointers to the start of each slice
    short getVoxel ( int x, int y, int z ) {...}
    void setVoxel ( int x, int y, int z, short v ) {...}
    };
  • 56. Trivial problem – threshold
    Threshold an image
    If intensity > 100, output 1
    otherwise output 0
    Present from simple to complex
    OpenMP
    TBB
    ITK
    pthread(see extra slides)
    49
  • 57. Threshold – OpenMP #1
    50
    void doThreshold ( Image* in, Image* out ) {
    #pragmaomp parallel for
    for ( int z = 0; z < in->mDepth; z++ ) {
    for ( int y = 0; y < in->mHeight; y++ ) {
    for ( int x = 0; x < in->mWidth; x++ ) {
    if ( in->getVoxel(x,y,z) > 100 ) {
    out->setVoxel(x,y,z,1);
    } else {
    out->setVoxel(x,y,z,0);
    }
    }
    }
    }
    }
    // NB: can loop over slices, rows or columns by moving
    // pragma, but must choose at compile time
  • 58. Threshold – OpenMP #2
    51
    void doThreshold ( Image* in, Image* out ) {
    #pragmaomp parallel for
    for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) {
    if ( in->mData[s] > 100 ) {
    out->mData[s] = 1;
    } else {
    out->mData[s] = 0;
    }
    }
    }
    // Likely a lot faster than previous code
  • 59. class Threshold {
    public:
    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}
    void operator() ( const tbb::blocked_range<int>& r ) {
    for ( int x = r.begin(); x != r.end(); ++x ) {
    if ( in->mData[x] > 100 ) {
    out->mData[x] = 1;
    } else {
    out->mData[x] = 0;
    }
    }
    }
    }
    ...
    parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ),
    Threshold ( in, out ), auto_partitioner() );
    // NB: default “grain size” for blocked_range is 1 pixel
    // tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )
    Threshold – TBB #1
    52
  • 60. class Threshold {
    public:
    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}
    void operator() ( const tbb::blocked_range<int>& r ) {...}
    void operator() ( const tbb::blocked_range2d<int,int>& r ) {
    for ( int z = in->mDepth; z < in->mDepth; z++ ) {
    for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {
    for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){
    if ( in->getVoxel(x,y,z) > 100 ) {
    out->setVoxel(x,y,z,1);
    } else {
    out->setVoxel(x,y,z,0);
    } } } }
    } };
    ...
    parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32
    0, in->mWidth, 32 ),
    Threshold ( in, out ), auto_partitioner() );
    Threshold – TBB #2
    53
  • 61. class Threshold {
    public:
    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}
    void operator() ( const tbb::blocked_range<int>& r ) {...}
    void operator() ( const tbb::blocked_range2d<int,int>& r ) {...}
    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {
    for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {
    for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {
    for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){
    if ( in->getVoxel(x,y,z) > 100 ) {
    out->setVoxel(x,y,z,1);
    } else {
    out->setVoxel(x,y,z,0);
    } } } }
    } };
    ...
    parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1
    0, in->mHeight, 32
    0, in->mWidth, 32 ),
    Threshold ( in, out ), auto_partitioner() );
    Threshold – TBB #3
    54
  • 62. Threshold – ITK solution
    55
    ThreadedGenerateData( const OutputImageRegionType out, int threadId)
    {
    ...
    // Define the iterators
    ImageRegionConstIterator<TIn> inputIt(inputPtr, out);
    ImageRegionIterator<TOut> outputIt(outputPtr, out);
    inputIt.GoToBegin();
    outputIt.GoToBegin();
    while( !inputIt.IsAtEnd() )
    {
    if ( inputIt.Get() > 100 ) {
    outputIt.Set ( 1 );
    } else {
    outputIt.Set ( 0 );
    {
    ++inputIt;
    ++outputIt;
    }
    }
  • 63. Interesting problem – anisotropic diffusion
    Edge preserving smoothing method
    Perona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639
    Iterative process
    Demonstrate
    OpenMP
    TBB
    (ITK has an implementation)
    (pthreads are tedious at the very least)
    Pop quiz – are the following correct?
    56
  • 64. Anisotropic diffusion – OpenMP
    57
    void doAD ( Image* in, Image* out ) {
    #pragmaomp parallel for
    for ( int t = 0; t < TotalTime; t++ ) {
    for ( int z = 0; z < in->mDepth; z++ ) {
    ...
    }
    }
    }
  • 65. Anisotropic diffusion – OpenMP
    58
    void doAD ( Image* in, Image* out ) {
    short *previousSlice, *slice, *nextSlice;
    for ( int t = 0; t < TotalTime; t++ ) {
    #pragmaomp parallel for
    for ( int z = 1; z < in->mDepth-1; z++ ) {
    previousSlice = in->mSlicePointers[z-1];
    slice = in->mSlicePointers[z];
    nextSlice = in->mSlicePointers[z+1];
    for ( int y = 1; y < in->mHeight-1; y++ ) {
    short* previousRow = slice + y-1 * in->mWidth;
    short* row = slice + y * in->mWidth;
    short* nextRow = slice + y-1 * in->mWidth;
    short* aboveRow = previousSlice + y * in->mWidth;
    short* belowRow = nextSlice + y * in->mWidth;
    for ( int x = 1; i < in->mWidth-1; x++ ) {
    dx = 2 * row[x] – row[x-1] – row[x+1];
    dy = 2 * row[x] – previousRow[x] – nextRow[x];
    dz = 2 * row[x] – aboveRow[x] – belowRow[x];
    ...
  • 66. Anisotropic diffusion – OpenMP
    59
    void doAD ( Image* in, Image* out ) {
    for ( int t = 0; t < TotalTime; t++ ) {
    #pragmaomp parallel for
    for ( int z = 1; z < in->mDepth-1; z++ ) {
    short* previousSlice = in->mSlicePointers[z-1];
    short* slice = in->mSlicePointers[z];
    short* nextSlice = in->mSlicePointers[z+1];
    for ( int y = 1; y < in->mHeight-1; y++ ) {
    short* previousRow = slice + y-1 * in->mWidth;
    short* row = slice + y * in->mWidth;
    short* nextRow = slice + y-1 * in->mWidth;
    short* aboveRow = previousSlice + y * in->mWidth;
    short* belowRow = nextSlice + y * in->mWidth;
    for ( int x = 1; i < in->mWidth-1; x++ ) {
    dx = 2 * row[x] – row[x-1] – row[x+1];
    dy = 2 * row[x] – previousRow[x] – nextRow[x];
    dz = 2 * row[x] – aboveRow[x] – belowRow[x];
    ...
  • 67. Anisotropic diffusion – TBB #1
    60
    class doAD {
    public:
    static ADConstants* sConstants;
    doAD ( Image* in, Image* out ) { ... }
    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {
    if ( !sConstants == NULL ) { initConstants(); }
    // process
    ...
    }
    }
  • 68. Threshold – TBB #2
    61
    class doAD {
    public:
    doAd ( ... ) {...}
    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {
    for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {
    for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {
    for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){
    ...
    } };
    ...
    parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth
    0, in->mHeight
    0, in->mWidth ),
    doAD ( in, out ), auto_partitioner() );
  • 69. Threshold – TBB #3
    62
    class doAD {
    public:
    static tbb::atomic<int> sProgress;
    tbb::spin_mutexmMutex;
    doAd ( ... ) {...}
    void reportProgress ( int p ) { ... }
    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {
    for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {
    tbb::spin_mutex::scoped_lock lock ( mMutex );
    sProgress++;
    reportProgress ( sProgress );
    for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {
    for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){
    ...
    } };
    ...
    doAD::sProgress = 0;
    parallel_for (...);
  • 70. Threshold – TBB #4
    63
    class doAD {
    public:
    static tbb::atomic<int> sProgress;
    static tbb::spin_mutexmMutex;
    doAd ( ... ) {...}
    void reportProgress ( int p ) { ... }
    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {
    for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {
    tbb::spin_mutex::scoped_lock lock ( mMutex );
    sProgress++;
    reportProgress ( sProgress );
    for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {
    for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){
    ...
    } };
    ...
    doAD::sProgress = 0;
    parallel_for (...);
  • 71. nowait
    Anisotropic diffusion – OpenMP (Progress)
    64
    using std;
    void doAD ( Image* in, Image* out ) {
    int progress = 0;
    for ( int t = 0; t < TotalTime; t++ ) {
    #pragmaomp parallel for
    for ( int s = 0; s < in->mDepth; s++ ) {
    #pragmaomp atomic
    progress++;
    #pragmaomp single
    reportProgress ( progress );
    ...
    }
    }
    }
  • 72. Real-life problem
    Compute Frangi’svesselness measure
    Frangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956
    Memory constrained solution
    ITK implementation requires 1.2G for 100M volume
    Antiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)
    Possible solutions using
    OpenMP, TBB
    65
  • 73. Vesselness
    66
  • 74. ITK Implementation – computing the Hessian
    6 volumes computed in serial
    Individual filters are threaded
    Good CPU usage
    High memory requirements
    67
  • 75. Design considerations
    Break problem into blocks
    Compute hessian, eigenvalues, and vesselness
    Reduces memory requirements
    Incurs overhead, boundary conditions
    68
  • 76. Design considerations
    69
    keep cpu’s full
  • 77. Design considerations – boundary condition
    70
  • 78. Trade-offs
    71
  • 79. Algorithm sketch – Serial
    72
    intBlockSize = 32;
    for ( intz = 0; z < image->mDepth; z += BlockSize ) {
    for ( inty = 0; y < image->mHeight; y += BlockSize ) {
    for ( intx = 0; x < image->mWidth; x += BlockSize ) {
    processBlock ( in, out, x, y, z, BlockSize );
    }
    }
    }
  • 80. Algorithm sketch – OpenMP
    73
    intBlockSize = 32;
    #pragmaomp parallel for
    for ( intz = 0; z < image->mDepth; z += BlockSize ) {
    for ( inty = 0; y < image->mHeight; y += BlockSize ) {
    for ( intx = 0; x < image->mWidth; x += BlockSize ) {
    processBlock ( in, out, x, y, z, BlockSize );
    }
    }
    }
    Each thread is on a different slice
    May cause cache contention
    Similar problems for “y” direction
  • 81. Algorithm sketch – OpenMP
    74
    intBlockSize = 32;
    for ( intz = 0; z < image->mDepth; z += BlockSize ) {
    for ( inty = 0; y < image->mHeight; y += BlockSize ) {
    #pragmaomp parallel for
    for ( intx = 0; x < image->mWidth; x += BlockSize ) {
    processBlock ( in, out, x, y, z, BlockSize );
    }
    }
    }
    All threads on same rows
    May not utilize all CPUs
    If Ratio of Width to BlockSize < # CPUs
    Better cache utilization
  • 82. Algorithm sketch – TBB
    75
    class Vesselness {
    public:
    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {
    // Process the block, could use ITK here
    processBlock (
    r.cols().begin(), r.rows().begin(), r.pages().begin(),
    r.cols().size(), r.rows().size(), r.pages().size() );
    ...
    parallel_for ( tbb::blocked_range3d<int,int,int>(
    0, in->mDepth, 32
    0, in->mHeight, 32
    0, in->mWidth, 32 ),
    Vesselness( in, out ), auto_partitioner() );
    Individual blocks
    Full CPUs
    May not have best cache performance
  • 83. Next steps
    Go try parallel development
    Try threads to gain understanding and insight
    Next OpenMP, adapting existing code
    TBB: more constructs, different approachs
    Experiment with new languages
    Erlang, Scala, Reia, Chapel, X10, Fortress...
    Check out some of the resources provided
    Have fun! It’s a brave new world out there...
    76
  • 84. Resources
    TBB (http://www.threadingbuildingblocks.org/)
    OpenMP (http://openmp.org/wp/)
    Books/Articles
    Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)
    Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)
    ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)
    The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)
    Tutorials
    Parallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)
    pthreads (https://computing.llnl.gov/tutorials/pthreads/)
    OpenMP (https://computing.llnl.gov/tutorials/openMP/)
    Other
    LLNL (https://computing.llnl.gov/)
    Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)
    GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)
    Intel Compiler (http://software.intel.com/en-us/intel-compilers/)
    77
  • 85. Resources
    Languages
    Erlang (http://www.erlang.org/)
    Scala (http://www.scala-lang.org/)
    Chapel (http://chapel.cs.washington.edu/)
    X10 (http://x10-lang.org/)
    Unified Parallel C (http://upc.gwu.edu/)
    Titanium (http://titanium.cs.berkeley.edu/)
    Co-Array Fortran (http://www.co-array.org/)
    ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)
    High Performance Fortran (http://hpff.rice.edu/)
    Fortress (http://projectfortress.sun.com/Projects/Community/)
    Others (http://www.google.com/search?q=parallel+programming+language)
    78
  • 86. Medical image processing strategies for multi-core CPUs
    Daniel Blezek, Mayo Clinic
    blezek.daniel@mayo.edu
  • 87. Thread construction – pthread example
    80
    include <pthread.h>
    void *(*start_routine)(void *);
    int
    pthread_create(pthread_t *restrict thread,
    const pthread_attr_t *restrict attr,
    void *(*start_routine)(void *),
    void *restrict arg);
    void
    pthread_exit(void *value_ptr);
    int
    pthread_join(pthread_t thread, void **value_ptr);
  • 88. Mutex – pthread example
    81
    #include <pthread.h>
    pthread_mutex_t myMutex;
    ...
    pthread_mutex_init ( &myMutex, NULL );
    ...
    pthread_mutex_lock ( &myMutex );
    // Critical Section, only one thread at a time
    ...
    pthread_mutex_unlock ( &myMutex );
    ...
    if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) {
    // We did get the lock, so we are in the critical section
    ...
    pthread_mutex_unlock ( &myMutex );
    }
  • 89. Mutex – Java example
    82
    import java.lang.*;
    class Foo {
    public synchronized int doWork () {
    // only one thread can execute doWork
    }
    Object resource;
    public int otherWork () {
    synchronized ( resource ) {
    // critical section, resource is the mutex
    ...
    }
    }
  • 90. struct Work { Image* in; Image *out; int start; int end; };
    Work workArray[THREADCOUNT];
    pthread_t thread[THREADCOUNT];
    void* doThreshold ( void* inWork ) {
    Work* work = (Work*) inWork;
    for ( int s = work->start; s < work->end; s++ ) {...}
    }
    ...
    pthread_attr_t attributes;
    pthread_attr_init ( &attributes );
    pthread_attr_setdetachstate ( &attributes, PTHREAD_CREATE_JOINABLE );
    for ( int t = 0; t < THREADCOUNT; t++ ) {
    initializeWork ( in, out, t, workArray[t] );
    pthread_create ( &thead[t], &attributes,
    doThreshold, (void*) workArray[t] );
    }
    for ( int t = 0; t < THREADCOUNT; t++ ) {
    pthread_join ( thread[t], NULL );
    }
    Threshold – pthread
    83
  • 91. Insight Toolkit
    84
  • 92. Semaphore
    Allow N threads access
    Protects limited resources
    Binary semaphore
    N = 1
    Equivalent to Mutex
    85
  • 93. ITK Implementation
    Threads operate across slices
    Only implemented behavior in ITK
    itk::MultiThreader is somewhat flexible
    Requires that you break the ITK model
    Uses Thread Join, higher overhead
    No thread pool
    86
  • 94. ITK – itk::MultiTheader
    87
    #include <itkMultiThreader.h>
    // Win32
    DWORD doWork ( LPVOID lpThreadParameter );
    // Pthread - Linux, Mac, Unix
    void* doWork ( void* inWork );
    itk::MultiThreader::Pointerthreader = itk::MultiThreader::New();
    threader->SetNumberOfThreads ( NumberOfThreads );
    for ( int i = 0; i < NumberOfThreads; i++ ) {
    threader->SetMultipleMethod ( i, doWork, (void*) work[i] );
    }
    // Explicit barrier, waits for Thread join
    threader->MultipleMethodExecute();
  • 95. #include <itkImageToImageFilter.h>
    template <In, Out> Worker : public ImageToImageFilter<In, Out> {
    ...
    void BeforeThreadedGenerateData() {
    // Master thread only
    ...
    }
    void ThreadedGenerateData(constOutputImageRegionType &r, int tid ){
    // Generate output data for r
    ...
    }
    voidAfterThreadedGenerateData() {
    // Master thread only
    ...
    }
    // Output split on last dimension
    // i.e. Slices for 3D volumes
    Insight Toolkit
    88
  • 96. Anisotropic diffusion – OpenMP
    89
    using std;
    void doAD ( Image* in, Image* out ) {
    for ( int t = 0; t < TotalTime; t++ ) {
    #pragmaomp parallel for
    for ( int slice = 0; slice < in->mDepth; slice++ ) {
    ...
    }
    }
    }