Beyond The Critical Section
Introduction <ul><li>Tony Albrecht </li></ul><ul><li>Senior Programmer for Pandemic Studios Brisbane </li></ul><ul><li>Ema...
Overview <ul><li>Justify myself </li></ul><ul><li>Start at the bottom </li></ul><ul><li>Continue from the top </li></ul><u...
Parallel Programming: Why? <ul><li>Moore’s Law </li></ul><ul><ul><li>Limits to sequential CPUs – parallel processing is ho...
Moore’s Law
“ Waaaah!” <ul><li>“ Parallel programming is hard.” </li></ul><ul><li>“ My code already runs incredibly fast – it doesn’t ...
Console trends
So? <ul><li>~2011 </li></ul><ul><li>~6TFlop machine </li></ul><ul><li>Next console will have between 64 and 128 processors...
How can we utilise 100+ CPUS? <ul><li>Start now </li></ul><ul><ul><li>Design </li></ul></ul><ul><ul><li>Implement </li></u...
The Problems <ul><li>Race conditions </li></ul>
Race Condition Example x++ x++ x=0 x=? Thread A Thread B
Race Condition Example R1 = 0 x=0 Thread A Thread B
Race Condition Example R1 = 0+1 x=0 Thread A Thread B
Race Condition Example R1 = 1 R1 = 0 x=0 Thread A Thread B
Race Condition Example R1 = 1 R1 = 0+1 x=1 Thread A Thread B
Race Condition Example <ul><li>Solution requires atomics or locking. </li></ul>R1 = 1 R1 = 1 x=1 Thread A Thread B
Atomics <ul><li>Atomic operations are uninterruptable, singular operations </li></ul><ul><ul><li>Get/Set </li></ul></ul><u...
Compare And Swap <ul><li>CAS(memory, oldValue, newValue) </li></ul><ul><ul><li>If(memory==oldValue)   memory=newValue; </l...
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1 x=2
Locking <ul><li>Used to serialise access to code. </li></ul><ul><ul><li>Like a key to a coffee shop toilet  </li></ul></ul...
Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul>
Deadlock <ul><li>“   When two trains approach each other at a crossing, both shall come to a full stop and neither shall s...
Deadlock <ul><li>Thread 1   Thread 2 </li></ul><ul><li>Generally can be considered to be a logic error </li></ul><ul><li>C...
The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul>
Read/write tearing <ul><li>More that one thread writing to the same memory at the same time. </li></ul><ul><li>The more da...
The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul><ul><li>Pr...
Priority Inversion <ul><li>Consider  threads with different priorities </li></ul><ul><li>Low priority thread holds a share...
The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul><ul><li>Pr...
The ABA problem <ul><li>Thread 1 reads ‘A’ from memory. </li></ul><ul><li>Thread 2 modifies memory value ‘A’ to ‘B’ and ba...
Consider a list and a thread pool… head a c b … ..
Thread A about to CAS head from a to b head a c b … .. CAS(&head->next,a,b);
Threads B: deq a & b head c … .. a b A & B are released into thread local pools
Thread B enq A - reused head a c … .. b A is added back
Thread A executes CAS head a c … .. b CAS(&head->next,a,b);
Thread A executes CAS successfully! head a c … .. b CAS(&head->next,a,b);
ABA Solution <ul><li>Tag each pointer with a count </li></ul><ul><li>Each time you use the ptr, inc the tag </li></ul><ul>...
The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul><ul><li>Pr...
Convoy/Stampede <ul><li>Convoy </li></ul><ul><ul><li>Multiple threads restricted by a bottleneck. </li></ul></ul><ul><li>S...
Higher Level Locking Primitives <ul><li>SpinLock </li></ul><ul><li>Mutex </li></ul><ul><li>Barrier </li></ul><ul><li>RWloc...
SpinLock <ul><li>Loop until a value is set. </li></ul><ul><li>No OS overhead with thread management </li></ul><ul><ul><li>...
Mutex <ul><li>Mutual Exclusion </li></ul><ul><li>A simple lock/unlock primitive </li></ul><ul><ul><li>Otherwise known as a...
Barrier <ul><li>Will block until ‘n’ threads signal it </li></ul><ul><li>Useful for ensuring that all threads have finishe...
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Done
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Signal
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Do other stuff
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Done Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Signal Done
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(1) Use results Do stuff More code Signal
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff Calc pi
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff
RWLock <ul><li>Allows many readers </li></ul><ul><li>But exclusive writing </li></ul><ul><ul><li>Writing blocks writers an...
Semaphore <ul><li>Generalisation of mutex </li></ul><ul><li>Allows  ‘c’ threads access  to critical code at once. </li></u...
Parallel Patterns <ul><li>Why patterns? </li></ul><ul><li>A set of templates to aid design </li></ul><ul><li>A common lang...
So, how do we start? <ul><li>Analyse your problem </li></ul><ul><li>Identify tasks that can execute concurrently </li></ul...
Problem Decomposition Problem From “Patterns for Parallel Programming”
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Pa...
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Line...
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Line...
Task Parallelism <ul><li>Task dominant, linear </li></ul><ul><li>Functionally driven problem </li></ul><ul><li>Many tasks ...
Divide and Conquer <ul><li>Task Dominant, recursive </li></ul><ul><li>Problem solved by splitting it into smaller sub-prob...
Geometric Decomposition <ul><li>Data dominant, linear </li></ul><ul><li>Decompose the data into chunks </li></ul><ul><li>S...
Recursive Data Pattern <ul><li>Data dominant, recursive </li></ul><ul><li>Operations on trees, lists, graphs </li></ul><ul...
Pipeline Pattern <ul><li>Data flow dominant, linear </li></ul><ul><li>Sets of data flowing through a sequence of stages </...
Event-Based Coordination <ul><li>Data flow, recursive </li></ul><ul><li>Groups of semi-independent tasks interacting in an...
Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distrib...
Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distribute...
SPMD <ul><li>Single Program, Multiple Data </li></ul><ul><li>Single source code image running on multiple threads </li></u...
Master/Worker <ul><li>Dominant force is the need to dynamically load balance  </li></ul><ul><ul><li>Tasks are highly varia...
Loop Parallelism <ul><li>Dominated by computationally expensive loops </li></ul><ul><li>Split iterations of the loop out t...
Fork/Join <ul><li>The number of concurrent tasks varies over the life of the execution. </li></ul><ul><li>Complex or recur...
Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Di...
Shared Data <ul><li>Required when </li></ul><ul><ul><li>At least one data structure is accessed by multiple tasks </li></u...
Distributed Array <ul><li>How can we distribute an array across many threads? </li></ul><ul><ul><li>Used in Geometric Deco...
Shared Queue <ul><li>Extremely valuable construct  </li></ul><ul><li>Fundamental part of Master/Worker (“Bag of Tasks”) </...
Lock free programming <ul><li>Locks  </li></ul><ul><ul><li>Simple, easy to use and implement </li></ul></ul><ul><ul><li>Bu...
Lock Free linked list <ul><li>Lock free linked list (ordered) </li></ul><ul><li>Easily generalised to other container clas...
Adding a node to a list head a c tail b
Adding a node: Step 1 head a c tail b Find where to insert
Adding a node: Step 2 head a c tail b newNode->Next = prev->Next;
Adding a node: Step 3 head a c tail b prev->Next = newNode;
Extending to multiple threads <ul><li>What could go wrong? </li></ul>
Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert
Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;
Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;
Add ‘b’ and ‘c’ concurrently head a d tail b c
Extending to multiple threads <ul><li>What could go wrong? </li></ul><ul><ul><li>Add another node between a & c </li></ul>...
Coarse Grained Locking <ul><li>Lock the list for each add or remove </li></ul><ul><li>Also lock for reads (find, iterators...
A concrete example <ul><li>10 producers </li></ul><ul><ul><li>Add 500 random numbers in a tight loop </li></ul></ul><ul><l...
Coarse Grain head a c tail b
Step 1: Lock list b head a c tail
Step 2 & 3:Find then Insert  b head a c tail
Step 4:Unlock head a c tail b
Coarse Grained locking <ul><li>Wide green bars are active locks </li></ul><ul><li>Little blips are adds or removes </li></...
Fine Grained Locking <ul><li>Add and Remove only affects neighbours </li></ul><ul><li>Give each Node a lock </li></ul><ul>...
Fine Grained Locking head a c tail b
Fine Grained Locking a c tail b head
Fine Grained Locking c tail b head a
Fine Grained Locking head tail b a c
Fine Grained Locking head tail b a c
Fine Grained Locking head a c tail b
Fine Grained Locking <ul><li>Blocking is much longer – due to overhead in creating a mutex </li></ul><ul><li>Very slow > 1...
Optimistic Locking <ul><li>Search without locking </li></ul><ul><li>Lock nodes once found, then validate them </li></ul><u...
Optimistic: Add(“g”) head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 2: Lock head a c d tail m g f k
Step 3: Validate head a c d tail m g f k
Step 3: Validate head a c d tail m g f k
Step 3: Validate - FAIL head a tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (success) head a e tail m g d f k
Step 4: Add head a e tail m g d f k
Step 5: Unlock head a e tail f k m g d
Optimistic Caveat <ul><li>We can’t delete nodes immediately </li></ul><ul><li>Another thread could be reading it </li></ul...
Delete Caveat: Validate head a e tail m g d f k
Delete Caveat: Validate head a e tail m g d f k
Delete Caveat: delete ‘d’ head a e tail m g f k d
Delete Caveat: Validate head a e tail m g f k d
Delete Caveat: Validate head a e tail m g f k d
Delete Caveat: Valid! head a e tail m g f k d
Optimistic Synchronisation <ul><li>~540ms </li></ul><ul><li>Most time was spent validating </li></ul><ul><li>Plus there wa...
Lazy Synchronization <ul><li>Attempt to speed up Optimistic Validation </li></ul><ul><li>Store a deleted flag in each node...
Lazy: Add(“g”) head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1b: Search (lock) head d tail f k m g a c
Step 1c: Search (mark) head d tail f k m g a c
Step 2d: lock (skip/unlock) head a c d tail m g f k
Step 3: Add/Validate head a d tail m g c f k
Step 4: Unlock head a d tail f k m g c
Lazy Synchronisation <ul><li>Still need to keep the deleted nodes. </li></ul><ul><li>Faster than Optimistic </li></ul><ul>...
Lazy Synchronisation ~330ms
Lock free (Non-Blocking) <ul><li>Can’t we just modify Lazy Sync to use CAS? </li></ul>
Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.
Introducing the AtomicMarkedPtr<> <ul><li>Wrapper on uint32 </li></ul><ul><li>Encapsulates an atomic pointer and a flag </...
AtomicMarkedPtr<> <ul><li>We can now use CAS to set a pointer  and  check a flag in a single atomic action. </li></ul><ul>...
Lock Free: Remove ‘d’ head a c d tail f k m Start loop:
Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;
Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;
Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);
LockFree: InternalFind() <ul><li>Finds pred and curr </li></ul><ul><li>Skips marked nodes. </li></ul><ul><li>Consider the ...
Second InternalFind() head a c tail f k m pred curr succ pred curr succ d
If succ is marked… head a c tail f k m pred curr succ pred curr succ d
… Skip it head a c tail f k m pred curr succ pred curr succ d
Lock Free Synchronisation <ul><li>No blocking at all </li></ul><ul><li>List is always in a consistent state. </li></ul><ul...
Lock free <ul><li>Full thread usage </li></ul><ul><li>~60ms </li></ul><ul><li>High thread coherency </li></ul>
Performance comparison
Real world considerations <ul><li>Cost of locking </li></ul><ul><li>Context switching </li></ul><ul><li>Memory coherency/l...
Advice <ul><li>Build a set of lock free containers </li></ul><ul><li>Design around data flow </li></ul><ul><li>Minimise lo...
References <ul><li>Patterns for Parallel Programming – T. Mattson et.al. </li></ul><ul><li>The Art of Multiprocessor Progr...
Upcoming SlideShare
Loading in...5
×

Parallel Programming: Beyond the Critical Section

6,398

Published on

An introduction to parallel programming; from basic problems and parallel primitives to patterns and lock free programming.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,398
On Slideshare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Parallel Programming: Beyond the Critical Section

    1. 1. Beyond The Critical Section
    2. 2. Introduction <ul><li>Tony Albrecht </li></ul><ul><li>Senior Programmer for Pandemic Studios Brisbane </li></ul><ul><li>Email: Tony.Albrecht0(at)gmail(dot)com </li></ul>
    3. 3. Overview <ul><li>Justify myself </li></ul><ul><li>Start at the bottom </li></ul><ul><li>Continue from the top </li></ul><ul><li>Quick look in the middle </li></ul>
    4. 4. Parallel Programming: Why? <ul><li>Moore’s Law </li></ul><ul><ul><li>Limits to sequential CPUs – parallel processing is how we avoid those limits. </li></ul></ul><ul><li>Programs must be parallel to get Moore level speedups. </li></ul><ul><li>Applies to programming in general. </li></ul>
    5. 5. Moore’s Law
    6. 6. “ Waaaah!” <ul><li>“ Parallel programming is hard.” </li></ul><ul><li>“ My code already runs incredibly fast – it doesn’t need to go any faster.” </li></ul><ul><li>“ It’s impossible to parallelise this algorithm.” </li></ul><ul><li>“ Only the rendering pipeline needs to be parallel.” </li></ul><ul><li>“ that’s only for super computers.” </li></ul>
    7. 7. Console trends
    8. 8. So? <ul><li>~2011 </li></ul><ul><li>~6TFlop machine </li></ul><ul><li>Next console will have between 64 and 128 processors </li></ul><ul><li>4 to 8GB of memory </li></ul><ul><li>128 processors!!!! </li></ul>
    9. 9. How can we utilise 100+ CPUS? <ul><li>Start now </li></ul><ul><ul><li>Design </li></ul></ul><ul><ul><li>Implement </li></ul></ul><ul><ul><li>Iterate </li></ul></ul><ul><ul><li>Learn </li></ul></ul>
    10. 10. The Problems <ul><li>Race conditions </li></ul>
    11. 11. Race Condition Example x++ x++ x=0 x=? Thread A Thread B
    12. 12. Race Condition Example R1 = 0 x=0 Thread A Thread B
    13. 13. Race Condition Example R1 = 0+1 x=0 Thread A Thread B
    14. 14. Race Condition Example R1 = 1 R1 = 0 x=0 Thread A Thread B
    15. 15. Race Condition Example R1 = 1 R1 = 0+1 x=1 Thread A Thread B
    16. 16. Race Condition Example <ul><li>Solution requires atomics or locking. </li></ul>R1 = 1 R1 = 1 x=1 Thread A Thread B
    17. 17. Atomics <ul><li>Atomic operations are uninterruptable, singular operations </li></ul><ul><ul><li>Get/Set </li></ul></ul><ul><ul><li>Inc/Dec (Add/Sub) </li></ul></ul><ul><ul><li>Compare And Swap </li></ul></ul><ul><ul><li>Plus other variations </li></ul></ul>
    18. 18. Compare And Swap <ul><li>CAS(memory, oldValue, newValue) </li></ul><ul><ul><li>If(memory==oldValue) memory=newValue; </li></ul></ul><ul><li>Surprisingly useful. </li></ul><ul><li>Simple locking primitive </li></ul><ul><ul><li>while(CAS(&lock,0,1)!=0) ; </li></ul></ul>
    19. 19. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B
    20. 20. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1
    21. 21. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1 x=2
    22. 22. Locking <ul><li>Used to serialise access to code. </li></ul><ul><ul><li>Like a key to a coffee shop toilet </li></ul></ul><ul><ul><ul><li>one key, </li></ul></ul></ul><ul><ul><ul><li>one toilet, </li></ul></ul></ul><ul><ul><ul><li>queue for access. </li></ul></ul></ul><ul><li>Lock()/Unlock() </li></ul>… Code… Lock(); // protected region Unlock(); ...more code…
    23. 23. Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
    24. 24. Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
    25. 25. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
    26. 26. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
    27. 27. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
    28. 28. Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
    29. 29. Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
    30. 30. The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul>
    31. 31. Deadlock <ul><li>“ When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone. ”   — Kansas Legislature </li></ul><ul><li>Deadlock can occur when 2 or more processes require resource(s) from another. </li></ul>
    32. 32. Deadlock <ul><li>Thread 1 Thread 2 </li></ul><ul><li>Generally can be considered to be a logic error </li></ul><ul><li>Can be painfully subtle and rare. </li></ul>Lock A Lock B Lock B Lock A Unl0ck A Unlock B
    33. 33. The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul>
    34. 34. Read/write tearing <ul><li>More that one thread writing to the same memory at the same time. </li></ul><ul><li>The more data, the more likely </li></ul><ul><li>Solve with synchronisation primitives. </li></ul>“ AAAAAAAA” “ BBBBBBBB” “ AAAABBBB”
    35. 35. The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul><ul><li>Priority Inversion </li></ul>
    36. 36. Priority Inversion <ul><li>Consider threads with different priorities </li></ul><ul><li>Low priority thread holds a shared resource </li></ul><ul><li>High priority thread tries to acquire that resource </li></ul><ul><li>High priority thread is blocked by the low </li></ul><ul><li>Medium priority threads will execute at the expense of the low and the high threads. </li></ul>
    37. 37. The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul><ul><li>Priority Inversion </li></ul><ul><li>The ABA Problem </li></ul>
    38. 38. The ABA problem <ul><li>Thread 1 reads ‘A’ from memory. </li></ul><ul><li>Thread 2 modifies memory value ‘A’ to ‘B’ and back to ‘A’ again. </li></ul><ul><li>Thread 1 resumes and assumes nothing has changed (using CAS) </li></ul><ul><li>Often associated with dynamically allocated memory </li></ul>
    39. 39. Consider a list and a thread pool… head a c b … ..
    40. 40. Thread A about to CAS head from a to b head a c b … .. CAS(&head->next,a,b);
    41. 41. Threads B: deq a & b head c … .. a b A & B are released into thread local pools
    42. 42. Thread B enq A - reused head a c … .. b A is added back
    43. 43. Thread A executes CAS head a c … .. b CAS(&head->next,a,b);
    44. 44. Thread A executes CAS successfully! head a c … .. b CAS(&head->next,a,b);
    45. 45. ABA Solution <ul><li>Tag each pointer with a count </li></ul><ul><li>Each time you use the ptr, inc the tag </li></ul><ul><li>Must do it atomically </li></ul>
    46. 46. The Problems <ul><li>Race conditions </li></ul><ul><li>Deadlocks </li></ul><ul><li>Read/write tearing </li></ul><ul><li>Priority Inversion </li></ul><ul><li>The ABA Problem </li></ul><ul><li>Thread scheduling problems </li></ul>
    47. 47. Convoy/Stampede <ul><li>Convoy </li></ul><ul><ul><li>Multiple threads restricted by a bottleneck. </li></ul></ul><ul><li>Stampede </li></ul><ul><ul><li>Multiple threads being started at once. </li></ul></ul>
    48. 48. Higher Level Locking Primitives <ul><li>SpinLock </li></ul><ul><li>Mutex </li></ul><ul><li>Barrier </li></ul><ul><li>RWlock </li></ul><ul><li>Semaphore </li></ul>
    49. 49. SpinLock <ul><li>Loop until a value is set. </li></ul><ul><li>No OS overhead with thread management </li></ul><ul><ul><li>Doesn’t sleep thread </li></ul></ul><ul><ul><li>Handy if you will never wait for long. </li></ul></ul><ul><ul><li>Very bad if you need to wait for a long time </li></ul></ul><ul><li>Can embed sleep() or Yield() </li></ul><ul><ul><li>But these can be perilous </li></ul></ul>
    50. 50. Mutex <ul><li>Mutual Exclusion </li></ul><ul><li>A simple lock/unlock primitive </li></ul><ul><ul><li>Otherwise known as a CriticalSection </li></ul></ul><ul><li>Used to serialise access to code. </li></ul><ul><li>Often overused. </li></ul><ul><li>More than just a spinlock </li></ul><ul><ul><li>can release thread </li></ul></ul><ul><li>Be aware of overhead </li></ul>
    51. 51. Barrier <ul><li>Will block until ‘n’ threads signal it </li></ul><ul><li>Useful for ensuring that all threads have finished a particular task. </li></ul>
    52. 52. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff
    53. 53. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Calculating
    54. 54. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Done
    55. 55. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Signal
    56. 56. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Do other stuff
    57. 57. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Calculating
    58. 58. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Done Calculating
    59. 59. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Signal Done
    60. 60. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(1) Use results Do stuff More code Signal
    61. 61. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff Calc pi
    62. 62. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff
    63. 63. RWLock <ul><li>Allows many readers </li></ul><ul><li>But exclusive writing </li></ul><ul><ul><li>Writing blocks writers and readers. </li></ul></ul><ul><ul><li>Writing waits until all readers have finished. </li></ul></ul>
    64. 64. Semaphore <ul><li>Generalisation of mutex </li></ul><ul><li>Allows ‘c’ threads access to critical code at once. </li></ul><ul><li>Basically an atomic integer </li></ul><ul><ul><li>Wait() will block if value == 0; then dec & cont. </li></ul></ul><ul><ul><li>Signal() increments value (allows a waiting thread to unblock) </li></ul></ul><ul><li>Conceptually, </li></ul><ul><ul><li>Mutexes stop other threads from running code </li></ul></ul><ul><ul><li>Semaphores tell other threads to run code </li></ul></ul>
    65. 65. Parallel Patterns <ul><li>Why patterns? </li></ul><ul><li>A set of templates to aid design </li></ul><ul><li>A common language </li></ul><ul><li>Aids education </li></ul><ul><li>Provides a familiar base to start implementation </li></ul>
    66. 66. So, how do we start? <ul><li>Analyse your problem </li></ul><ul><li>Identify tasks that can execute concurrently </li></ul><ul><li>Identify data local to each task </li></ul><ul><li>Identify task order/schedule </li></ul><ul><li>Analyse dependencies between tasks. </li></ul><ul><li>Consider the HW you are running on </li></ul>
    67. 67. Problem Decomposition Problem From “Patterns for Parallel Programming”
    68. 68. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Parallel Programming”
    69. 69. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive From “Patterns for Parallel Programming”
    70. 70. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination From “Patterns for Parallel Programming”
    71. 71. Task Parallelism <ul><li>Task dominant, linear </li></ul><ul><li>Functionally driven problem </li></ul><ul><li>Many tasks that may depend on each other </li></ul><ul><ul><li>Try to minimise dependencies </li></ul></ul><ul><li>Key elements: </li></ul><ul><ul><li>Tasks </li></ul></ul><ul><ul><li>Dependencies </li></ul></ul><ul><ul><li>Schedule </li></ul></ul>
    72. 72. Divide and Conquer <ul><li>Task Dominant, recursive </li></ul><ul><li>Problem solved by splitting it into smaller sub-problems and solving them independently. </li></ul><ul><li>Generally its easy to take a sequential Divide and Conquer implementation and parallelise it. </li></ul>
    73. 73. Geometric Decomposition <ul><li>Data dominant, linear </li></ul><ul><li>Decompose the data into chunks </li></ul><ul><li>Solve for chunks independently. </li></ul><ul><ul><li>Beware of edge dependencies. </li></ul></ul>
    74. 74. Recursive Data Pattern <ul><li>Data dominant, recursive </li></ul><ul><li>Operations on trees, lists, graphs </li></ul><ul><ul><li>Dependencies can often prohibit parallelism </li></ul></ul><ul><li>Often requires tricky recasting of problem </li></ul><ul><ul><li>ie operate on all tree elements in parallel </li></ul></ul><ul><ul><li>More work, but distributed across more cores </li></ul></ul>
    75. 75. Pipeline Pattern <ul><li>Data flow dominant, linear </li></ul><ul><li>Sets of data flowing through a sequence of stages </li></ul><ul><li>Each stage is independent </li></ul><ul><li>Easy to understand - simple, dedicated code </li></ul>
    76. 76. Event-Based Coordination <ul><li>Data flow, recursive </li></ul><ul><li>Groups of semi-independent tasks interacting in an irregular fashion. </li></ul><ul><li>Tasks sending events to other tasks which send tasks… </li></ul><ul><li>Can be highly complex </li></ul><ul><li>Tricky to load balance </li></ul>
    77. 77. Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
    78. 78. Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
    79. 79. SPMD <ul><li>Single Program, Multiple Data </li></ul><ul><li>Single source code image running on multiple threads </li></ul><ul><li>Very common </li></ul><ul><li>Easy to maintain </li></ul><ul><li>Easy to understand </li></ul>
    80. 80. Master/Worker <ul><li>Dominant force is the need to dynamically load balance </li></ul><ul><ul><li>Tasks are highly variable ie duration/cost </li></ul></ul><ul><ul><li>Program structure doesn’t map onto loops </li></ul></ul><ul><ul><li>Cores vary in performance. </li></ul></ul><ul><li>“ Bag of Tasks” </li></ul><ul><ul><li>Master sets up tasks and waits for completion </li></ul></ul><ul><ul><li>Workers grab task from queue, execute and then grab the next one. </li></ul></ul>
    81. 81. Loop Parallelism <ul><li>Dominated by computationally expensive loops </li></ul><ul><li>Split iterations of the loop out to threads </li></ul><ul><li>Be careful of memory use and process granularity </li></ul>
    82. 82. Fork/Join <ul><li>The number of concurrent tasks varies over the life of the execution. </li></ul><ul><li>Complex or recursive relations between tasks </li></ul><ul><li>Either </li></ul><ul><ul><li>Direct task/core mapping </li></ul></ul><ul><ul><li>Thread pool </li></ul></ul>
    83. 83. Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
    84. 84. Shared Data <ul><li>Required when </li></ul><ul><ul><li>At least one data structure is accessed by multiple tasks </li></ul></ul><ul><ul><li>At least one task modifies the shared data </li></ul></ul><ul><ul><li>The tasks potentially need to use the modified value. </li></ul></ul><ul><li>Solutions </li></ul><ul><ul><li>Serialise execution – mutual exclusion </li></ul></ul><ul><ul><li>Noninterfering sets of operations </li></ul></ul><ul><ul><li>RWlocks </li></ul></ul>
    85. 85. Distributed Array <ul><li>How can we distribute an array across many threads? </li></ul><ul><ul><li>Used in Geometric Decomposition </li></ul></ul><ul><li>Break array into thread specific parts </li></ul><ul><li>Maximise locality per thread </li></ul><ul><li>Be wary of cache line overlap </li></ul><ul><ul><li>Keep data distribution coarse </li></ul></ul>
    86. 86. Shared Queue <ul><li>Extremely valuable construct </li></ul><ul><li>Fundamental part of Master/Worker (“Bag of Tasks”) </li></ul><ul><li>Must be consistent and work with many competing threads. </li></ul><ul><li>Must be as efficient as possible </li></ul><ul><ul><li>Preferably lock free </li></ul></ul>
    87. 87. Lock free programming <ul><li>Locks </li></ul><ul><ul><li>Simple, easy to use and implement </li></ul></ul><ul><ul><li>But serialise code execution </li></ul></ul><ul><li>Lock Free </li></ul><ul><ul><li>Tricky to implement and debug </li></ul></ul>
    88. 88. Lock Free linked list <ul><li>Lock free linked list (ordered) </li></ul><ul><li>Easily generalised to other container classes </li></ul><ul><ul><li>Stacks </li></ul></ul><ul><ul><li>Queues </li></ul></ul><ul><li>Relatively simple to understand </li></ul>
    89. 89. Adding a node to a list head a c tail b
    90. 90. Adding a node: Step 1 head a c tail b Find where to insert
    91. 91. Adding a node: Step 2 head a c tail b newNode->Next = prev->Next;
    92. 92. Adding a node: Step 3 head a c tail b prev->Next = newNode;
    93. 93. Extending to multiple threads <ul><li>What could go wrong? </li></ul>
    94. 94. Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert
    95. 95. Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;
    96. 96. Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;
    97. 97. Add ‘b’ and ‘c’ concurrently head a d tail b c
    98. 98. Extending to multiple threads <ul><li>What could go wrong? </li></ul><ul><ul><li>Add another node between a & c </li></ul></ul><ul><ul><li>A or c could be deleted </li></ul></ul><ul><ul><li>A concurrent read could reach a dangling pointer. </li></ul></ul><ul><ul><li>Any number of multiples of the above </li></ul></ul><ul><li>If anything can go wrong, it will. </li></ul><ul><li>So, how do we make it thread safe? </li></ul><ul><ul><li>Lets examine some solutions </li></ul></ul>
    99. 99. Coarse Grained Locking <ul><li>Lock the list for each add or remove </li></ul><ul><li>Also lock for reads (find, iterators) </li></ul><ul><li>Will effectively serialise the list </li></ul><ul><ul><li>Only one thread at a time can access the list. </li></ul></ul><ul><ul><li>Correctness at the expense of performance . </li></ul></ul>
    100. 100. A concrete example <ul><li>10 producers </li></ul><ul><ul><li>Add 500 random numbers in a tight loop </li></ul></ul><ul><li>10 consumers </li></ul><ul><ul><li>Remove the 500 numbers in a tight loop </li></ul></ul><ul><li>Each in its own thread </li></ul><ul><ul><li>21 threads </li></ul></ul><ul><li>Running on PS3 using SNTuner to profile </li></ul>
    101. 101. Coarse Grain head a c tail b
    102. 102. Step 1: Lock list b head a c tail
    103. 103. Step 2 & 3:Find then Insert b head a c tail
    104. 104. Step 4:Unlock head a c tail b
    105. 105. Coarse Grained locking <ul><li>Wide green bars are active locks </li></ul><ul><li>Little blips are adds or removes </li></ul><ul><li>Execution took 416ms (profiling significantly impacts performance) </li></ul>
    106. 106. Fine Grained Locking <ul><li>Add and Remove only affects neighbours </li></ul><ul><li>Give each Node a lock </li></ul><ul><ul><li>(So, creating a node creates a mutex) </li></ul></ul><ul><ul><li>Lock only neighbours when adding or removing. </li></ul></ul><ul><li>When iterating along the list you must lock/unlock as you go. </li></ul>
    107. 107. Fine Grained Locking head a c tail b
    108. 108. Fine Grained Locking a c tail b head
    109. 109. Fine Grained Locking c tail b head a
    110. 110. Fine Grained Locking head tail b a c
    111. 111. Fine Grained Locking head tail b a c
    112. 112. Fine Grained Locking head a c tail b
    113. 113. Fine Grained Locking <ul><li>Blocking is much longer – due to overhead in creating a mutex </li></ul><ul><li>Very slow > 1200ms </li></ul><ul><li>Better solution would have been to have a pool of mutexes that could be used </li></ul>
    114. 114. Optimistic Locking <ul><li>Search without locking </li></ul><ul><li>Lock nodes once found, then validate them </li></ul><ul><ul><li>Valid if you can navigate to it from head. </li></ul></ul><ul><li>If invalid, search from head again. </li></ul>
    115. 115. Optimistic: Add(“g”) head a c d tail f k m g
    116. 116. Step 1: Search head a c d tail f k m g
    117. 117. Step 1: Search head a c d tail f k m g
    118. 118. Step 1: Search head a c d tail f k m g
    119. 119. Step 1: Search head a c d tail f k m g
    120. 120. Step 1: Search head a c d tail f k m g
    121. 121. Step 1: Search head a c d tail f k m g
    122. 122. Step 2: Lock head a c d tail m g f k
    123. 123. Step 3: Validate head a c d tail m g f k
    124. 124. Step 3: Validate head a c d tail m g f k
    125. 125. Step 3: Validate - FAIL head a tail m g d f k
    126. 126. Step 3a: Validate (retry) head a e tail m g d f k
    127. 127. Step 3a: Validate (retry) head a e tail m g d f k
    128. 128. Step 3a: Validate (retry) head a e tail m g d f k
    129. 129. Step 3a: Validate (retry) head a e tail m g d f k
    130. 130. Step 3a: Validate (success) head a e tail m g d f k
    131. 131. Step 4: Add head a e tail m g d f k
    132. 132. Step 5: Unlock head a e tail f k m g d
    133. 133. Optimistic Caveat <ul><li>We can’t delete nodes immediately </li></ul><ul><li>Another thread could be reading it </li></ul><ul><ul><li>Can’t rely on memory not being changed. </li></ul></ul><ul><li>Use deferred garbage collection </li></ul><ul><ul><li>Delete in a ‘safe’ part of a frame. </li></ul></ul><ul><li>Or use invasive lists (supply own nodes) </li></ul><ul><li>Find() requires validation (Locks). </li></ul>
    134. 134. Delete Caveat: Validate head a e tail m g d f k
    135. 135. Delete Caveat: Validate head a e tail m g d f k
    136. 136. Delete Caveat: delete ‘d’ head a e tail m g f k d
    137. 137. Delete Caveat: Validate head a e tail m g f k d
    138. 138. Delete Caveat: Validate head a e tail m g f k d
    139. 139. Delete Caveat: Valid! head a e tail m g f k d
    140. 140. Optimistic Synchronisation <ul><li>~540ms </li></ul><ul><li>Most time was spent validating </li></ul><ul><li>Plus there was overhead in creating a mutex per node for the lock. </li></ul><ul><li>Again, a pool of mutexes would benefit. </li></ul>
    141. 141. Lazy Synchronization <ul><li>Attempt to speed up Optimistic Validation </li></ul><ul><li>Store a deleted flag in each node </li></ul><ul><li>Find() is then lock free </li></ul><ul><ul><li>Just check the deleted flag on success. </li></ul></ul>
    142. 142. Lazy: Add(“g”) head a c d tail f k m g
    143. 143. Step 1: Search head a c d tail f k m g
    144. 144. Step 1: Search head a c d tail f k m g
    145. 145. Step 1: Search head a c d tail f k m g
    146. 146. Step 1a: Search (delete c) head a c d tail f k m g
    147. 147. Step 1a: Search (delete c) head a c d tail f k m g
    148. 148. Step 1a: Search (delete c) head a c d tail f k m g
    149. 149. Step 1a: Search (delete c) head a c d tail f k m g
    150. 150. Step 1b: Search (lock) head d tail f k m g a c
    151. 151. Step 1c: Search (mark) head d tail f k m g a c
    152. 152. Step 2d: lock (skip/unlock) head a c d tail m g f k
    153. 153. Step 3: Add/Validate head a d tail m g c f k
    154. 154. Step 4: Unlock head a d tail f k m g c
    155. 155. Lazy Synchronisation <ul><li>Still need to keep the deleted nodes. </li></ul><ul><li>Faster than Optimistic </li></ul><ul><li>Still serialises. </li></ul>
    156. 156. Lazy Synchronisation ~330ms
    157. 157. Lock free (Non-Blocking) <ul><li>Can’t we just modify Lazy Sync to use CAS? </li></ul>
    158. 158. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
    159. 159. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
    160. 160. Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next; | prev->next=b;
    161. 161. Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.
    162. 162. Introducing the AtomicMarkedPtr<> <ul><li>Wrapper on uint32 </li></ul><ul><li>Encapsulates an atomic pointer and a flag </li></ul><ul><li>Allows testing of a flag and updating of a pointer atomically. </li></ul><ul><li>Use LSB for the flag </li></ul>AtomicMarkedPtr<Node> next; next->CompareAndSet(eValue, nValue,eFlag, nFlag);
    163. 163. AtomicMarkedPtr<> <ul><li>We can now use CAS to set a pointer and check a flag in a single atomic action. </li></ul><ul><ul><li>ie. check deleted status and change pointer at same time. </li></ul></ul>class Node { public: Node(); AtomicMarkedPtr<Node> m_Next; T m_Data; int32 m_Key; };
    164. 164. Lock Free: Remove ‘d’ head a c d tail f k m Start loop:
    165. 165. Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;
    166. 166. Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;
    167. 167. Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);
    168. 168. LockFree: InternalFind() <ul><li>Finds pred and curr </li></ul><ul><li>Skips marked nodes. </li></ul><ul><li>Consider the list at Step 2 in previous example </li></ul><ul><li>and, lets introduce a second thread calling InternalFind(); </li></ul>
    169. 169. Second InternalFind() head a c tail f k m pred curr succ pred curr succ d
    170. 170. If succ is marked… head a c tail f k m pred curr succ pred curr succ d
    171. 171. … Skip it head a c tail f k m pred curr succ pred curr succ d
    172. 172. Lock Free Synchronisation <ul><li>No blocking at all </li></ul><ul><li>List is always in a consistent state. </li></ul><ul><li>Faster threads help out slower ones. </li></ul>
    173. 173. Lock free <ul><li>Full thread usage </li></ul><ul><li>~60ms </li></ul><ul><li>High thread coherency </li></ul>
    174. 174. Performance comparison
    175. 175. Real world considerations <ul><li>Cost of locking </li></ul><ul><li>Context switching </li></ul><ul><li>Memory coherency/latency </li></ul><ul><li>Size/granularity of tasks </li></ul>
    176. 176. Advice <ul><li>Build a set of lock free containers </li></ul><ul><li>Design around data flow </li></ul><ul><li>Minimise locking </li></ul><ul><li>You can have more than ‘n’ threads on an ‘n’ core machine </li></ul><ul><li>Profile, profile, profile. </li></ul>
    177. 177. References <ul><li>Patterns for Parallel Programming – T. Mattson et.al. </li></ul><ul><li>The Art of Multiprocessor Programming – M Herlihy and Nir Shavit </li></ul><ul><li>http://www.top500.org/ </li></ul><ul><li>Flow Based Programming - http://www.jpaulmorrison.com/fbp/index.shtml </li></ul><ul><li>http://www.valvesoftware.com/publications/2007/GDC2007_SourceMulticore.pdf </li></ul><ul><li>http://www.netrino.com/node/202 </li></ul><ul><li>http://blogs.intel.com/research/2007/08/what_makes_parallel_programmin.php </li></ul><ul><li>The Little book of Semaphores - http://www.greenteapress.com/semaphores/ </li></ul><ul><li>My Blog: 7DOF - http://seven-degrees-of-freedom.blogspot.com/ </li></ul>

    ×