The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
Solution Patterns for Parallel Programming
1. Solution Patterns for Parallel
Programming
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera
3. Building a Solution by Composition
We often solve problems by reducing the problem
to a composition of known problems
Finding the way to Habarana?
Sorting 1 million integers
Can we solve this with Mutex & Semaphores?
Mutex for mutual exclusion
Semaphores for signaling
There is another level
3
4. Designing Parallel Algorithms
Parallel algorithm design is not easily reduced to
simple recipes
Parallel version of serial algorithm is not necessarily
optimum
Good algorithms require creativity
Goal
Suggest a framework within which parallel algorithm
design can be explored
Develop intuition as to what constitutes a good
parallel algorithm
4
5. Methodical Design
Partitioning &
communication focus
on concurrency &
scalability
Agglomeration &
mapping focus on
locality & other
performance issues
5
Source: www.drdobbs.com/parallel/designing-parallel-
algorithms-part-1/223100878
6. Methodical Design (Cont.)
1. Partitioning
Decompose computation/data into small tasks/chunks
Focus on recognizing opportunities for parallel
execution
Practical issues such as no of CPUs are ignored
2. Communication
Determine communication required to coordinate task
execution
Define communication structures & algorithms
6
7. Methodical Design (Cont.)
3. Agglomeration
Defined task & communication structures are
evaluated with respect to
Performance requirements
Implementation costs
If necessary, tasks are combined into larger tasks to
improve
Performance
Reduce development costs
7
Source: www.drdobbs.com/architecture-and-design/designing-
parallel-algorithms-part-3/223500075
8. Methodical Design (Cont.)
4. Mapping
Each task is assigned to a processor while attempting
to satisfy competing goals of
Maximizing processor utilization
Minimizing communication costs
Static mapping
At design/compile time
Dynamic mapping
At runtime by load-balancing algorithms
8
9. Parallel Algorithm Design Issues
Efficiency
Scalability
Partitioning computations
Domain decomposition – based on data
Functional decomposition – based on computation
Locality
Spatial & temporal
Synchronous & asynchronous communication
Agglomeration to reduce communication
Load-balancing
9
10. 3 Ways to Parallelize
1. By Data
Partition data & give it to different threads
2. By Task
Partition task into smaller tasks & give it to different
threads
3. By Order
Partition task into steps & give them to different threads
10
11. By Data
Use SPMD model
When data can be processed locally with lower
dependencies with other data
Patterns
Loop parallel, embarrassingly parallel
Large data unit – under utilization
Small data units – thrashing
Chunk layout
Based on dependencies & caching
Example – Processing geographical data
11
12. By Task
Task Parallel, Divide & Conquer
Too many tasks – thrashing
Too little tasks – under utilization
Dependencies among tasks
Removable
Code transformations
Separable
Accumulation operations (average, sum, count)
Extrema (max, min)
Read only, Read/Write
12
13. By Order
Pipeline & Asynchronous Agents
Dependencies
Temporal – before/after
Same time
None
13
14. Load Balancing
Some threads will be busy while others are idle
Counter by distributing load equally
When cost of problem is well understood this is possible
e.g., matrix multiplication, known tree walk
Some other problems are not that simple
Hard to predict how workload will be distributed use
dynamic load balancing
But require communication between threads/tasks
2 methods for dynamic load balancing
Task queues
Work stealing
14
15. Task Queues
Multiple instance of task queues (producer
consumer)
Threads comes to the task queue after finishing a
task & grab next task
Typically run with thread pool with fixed no of
threads
15
Source: http://blog.zenika.com
16. Work Stealing
Every thread has a work/task queue
When 1 thread runs out of work, it goes to other
task queue & “steal” the work
16
Source: http://karlsenchoi.blogspot.com
17. Efficiency = Maximizing Parallelism?
Usually it is 2 things
Run algorithm in MAX no of threads with minimal
communication/waiting
When size of the problem grows, algorithm can handle
it by adding new resources
It’s done by right architecture + tuning
There are no clear way to do it
Just like “design patterns” for OOP, people have
identified parallel programming patterns
17
18. Solution Patterns for Parallelism
Loop Parallel
Fork/Join
Divide and Conquer
Producer Consumer/ Pipe Line
Asynchronous Agents
Producer/Consumer
18
19. Loop Parallel
If each iteration in a loop only depends on that
iteration results + read only data, each iteration
can run in a different thread
As it’s based on data, also called data parallelism
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i])
}
19
20. Which for These are Loop Parallel?
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i-1])
}
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], C[i-1])
}
20
22. Fork/Join
Fork job into smaller tasks (independent if
possible), perform them, & join them
Examples
Calculate the mean across an array
Tree walk
How to partition?
By Data, e.g., SPMD
By Task, e.g., MPSD
22
Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model
23. Fork/Join (Cont.)
Size of work Unit
Small units – thrashing
Big Unit – imbalance
Balancing load among threads
Static allocation
If data/task is completely known
E.g., matrix addition
Dynamic allocation (tree walks)
Task queues
Work Stealing
23
25. Divide & Conquer
Break problem into recursive sub-problems &
assign them to different threads
Examples
Quick sort
Search for a value in a tree
Calculating Fibonacci Sequence
Often fork again, leads to an execution tree
Recursion
May or may not have a join step
Deep tree – thrashing
Shallow tree – under utilization 25
26. Divide & Conquer – Fibonacci
Sequence
Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein
26
27. Producer Consumer
This pattern is often used, as it helps
dynamically balance workload
E.g., crawling the Web
Place new links in a queue so others can pick it up
27
Source: http://vichargrave.com/multithreaded-work-queue-in-c/
28. Pipeline
Break a task into small steps (which may have
dependencies) & assign execution of steps to
different threads
Example
Read file, sort file, & write to file
Work hand off from step-to-step
Each task doesn’t gain, but if there are many
instances of the task, we get a better throughput
Gain come from tuning
Example – read/write are slow but sort is fast, can
add more threads to read/write & less threads to sort 28
29. Pipeline (Cont.)
Long pipeline – high throughput
Short pipeline – low latency
Passing data from one stage to another
Message passing
Shared queues
29
30. Asynchronous Agents
Here task is done by a set of agents
Working in P2P fashion
No clear structure
They talk to each other via asynchronous messages
Example – Detecting storms using weather data
Many agents, each know some aspects about storms
Weather events are sent to them, which in turn fire
other events, leading to detection
30
Source: http://blogs.msdn.com/