Solution Patterns for Parallel Programming

Solution Patterns for Parallel
Programming
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera

Outline
 Designing parallel algorithms
 Solution patterns for parallelism
 Loop Parallel
 Fork/Join
 Divide & Conquer
 Pipe Line
 Asynchronous Agents
 Producer/Consumer
 Load balancing
2

Building a Solution by Composition
 We often solve problems by reducing the problem
to a composition of known problems
 Finding the way to Habarana?
 Sorting 1 million integers
 Can we solve this with Mutex & Semaphores?
 Mutex for mutual exclusion
 Semaphores for signaling
 There is another level
3

Designing Parallel Algorithms
 Parallel algorithm design is not easily reduced to
simple recipes
 Parallel version of serial algorithm is not necessarily
optimum
 Good algorithms require creativity
 Goal
 Suggest a framework within which parallel algorithm
design can be explored
 Develop intuition as to what constitutes a good
parallel algorithm
4

Methodical Design
 Partitioning &
communication focus
on concurrency &
scalability
 Agglomeration &
mapping focus on
locality & other
performance issues
5
Source: www.drdobbs.com/parallel/designing-parallel-
algorithms-part-1/223100878

Methodical Design (Cont.)
1. Partitioning
 Decompose computation/data into small tasks/chunks
 Focus on recognizing opportunities for parallel
execution
 Practical issues such as no of CPUs are ignored
2. Communication
 Determine communication required to coordinate task
execution
 Define communication structures & algorithms
6

3. Agglomeration
 Defined task & communication structures are
evaluated with respect to
 Performance requirements
 Implementation costs
 If necessary, tasks are combined into larger tasks to
improve
 Performance
 Reduce development costs
7
Source: www.drdobbs.com/architecture-and-design/designing-
parallel-algorithms-part-3/223500075

4. Mapping
 Each task is assigned to a processor while attempting
to satisfy competing goals of
 Maximizing processor utilization
 Minimizing communication costs
 Static mapping
 At design/compile time
 Dynamic mapping
 At runtime by load-balancing algorithms
8

Parallel Algorithm Design Issues
 Efficiency
 Scalability
 Partitioning computations
 Domain decomposition – based on data
 Functional decomposition – based on computation
 Locality
 Spatial & temporal
 Synchronous & asynchronous communication
 Agglomeration to reduce communication
 Load-balancing
9

3 Ways to Parallelize
1. By Data
 Partition data & give it to different threads
2. By Task
 Partition task into smaller tasks & give it to different
threads
3. By Order
 Partition task into steps & give them to different threads
10

By Data
 Use SPMD model
 When data can be processed locally with lower
dependencies with other data
 Patterns
 Loop parallel, embarrassingly parallel
 Large data unit – under utilization
 Small data units – thrashing
 Chunk layout
 Based on dependencies & caching
 Example – Processing geographical data
11

By Task
 Task Parallel, Divide & Conquer
 Too many tasks – thrashing
 Too little tasks – under utilization
 Dependencies among tasks
 Removable
 Code transformations
 Separable
 Accumulation operations (average, sum, count)
 Extrema (max, min)
 Read only, Read/Write
12

By Order
 Pipeline & Asynchronous Agents
 Dependencies
 Temporal – before/after
 Same time
 None
13

Load Balancing
 Some threads will be busy while others are idle
 Counter by distributing load equally
 When cost of problem is well understood this is possible
 e.g., matrix multiplication, known tree walk
 Some other problems are not that simple
 Hard to predict how workload will be distributed  use
dynamic load balancing
 But require communication between threads/tasks
 2 methods for dynamic load balancing
 Task queues
 Work stealing
14

Task Queues
 Multiple instance of task queues (producer
consumer)
 Threads comes to the task queue after finishing a
task & grab next task
 Typically run with thread pool with fixed no of
threads
15
Source: http://blog.zenika.com

Work Stealing
 Every thread has a work/task queue
 When 1 thread runs out of work, it goes to other
task queue & “steal” the work
16
Source: http://karlsenchoi.blogspot.com

Efficiency = Maximizing Parallelism?
 Usually it is 2 things
 Run algorithm in MAX no of threads with minimal
communication/waiting
 When size of the problem grows, algorithm can handle
it by adding new resources
 It’s done by right architecture + tuning
 There are no clear way to do it
 Just like “design patterns” for OOP, people have
identified parallel programming patterns
17

Solution Patterns for Parallelism
 Loop Parallel
 Fork/Join
 Divide and Conquer
 Producer Consumer/ Pipe Line
 Asynchronous Agents
 Producer/Consumer
18

Loop Parallel
 If each iteration in a loop only depends on that
iteration results + read only data, each iteration
can run in a different thread
 As it’s based on data, also called data parallelism
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i])
}
19

Which for These are Loop Parallel?
int[] A = .. int[] B = .. int[] C = ..
C[i] = F(A[i], B[i-1])
}
int[] A = .. int[] B = .. int[] C = ..
C[i] = F(A[i], C[i-1])
}
20

Implementing Loop Parallel
 OpenMP example
21

Fork/Join
 Fork job into smaller tasks (independent if
possible), perform them, & join them
 Examples
 Calculate the mean across an array
 Tree walk
 How to partition?
 By Data, e.g., SPMD
 By Task, e.g., MPSD
22
Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model

Fork/Join (Cont.)
 Size of work Unit
 Small units – thrashing
 Big Unit – imbalance
 Balancing load among threads
 Static allocation
 If data/task is completely known
 E.g., matrix addition
 Dynamic allocation (tree walks)
 Task queues
 Work Stealing
23

Implementing Fork/Join
 Pthreads
 OpenMP
24

Divide & Conquer
 Break problem into recursive sub-problems &
assign them to different threads
 Examples
 Quick sort
 Search for a value in a tree
 Calculating Fibonacci Sequence
 Often fork again, leads to an execution tree
 Recursion
 May or may not have a join step
 Deep tree – thrashing
 Shallow tree – under utilization 25

Divide & Conquer – Fibonacci
Sequence
Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein
26

Producer Consumer
 This pattern is often used, as it helps
dynamically balance workload
 E.g., crawling the Web
 Place new links in a queue so others can pick it up
27
Source: http://vichargrave.com/multithreaded-work-queue-in-c/

Pipeline
 Break a task into small steps (which may have
dependencies) & assign execution of steps to
different threads
 Example
 Read file, sort file, & write to file
 Work hand off from step-to-step
 Each task doesn’t gain, but if there are many
instances of the task, we get a better throughput
 Gain come from tuning
 Example – read/write are slow but sort is fast, can
add more threads to read/write & less threads to sort 28

Pipeline (Cont.)
 Long pipeline – high throughput
 Short pipeline – low latency
 Passing data from one stage to another
 Message passing
 Shared queues
29

Asynchronous Agents
 Here task is done by a set of agents
 Working in P2P fashion
 No clear structure
 They talk to each other via asynchronous messages
 Example – Detecting storms using weather data
 Many agents, each know some aspects about storms
 Weather events are sent to them, which in turn fire
other events, leading to detection
30
Source: http://blogs.msdn.com/

Solution Patterns for Parallel Programming

Recommended

Recommended

More Related Content

Similar to Solution Patterns for Parallel Programming

Similar to Solution Patterns for Parallel Programming (20)

More from Dilum Bandara

More from Dilum Bandara (20)

Recently uploaded

Recently uploaded (20)

Solution Patterns for Parallel Programming

Editor's Notes