SlideShare a Scribd company logo
1 of 31
Solution Patterns for Parallel
Programming
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera
Outline
 Designing parallel algorithms
 Solution patterns for parallelism
 Loop Parallel
 Fork/Join
 Divide & Conquer
 Pipe Line
 Asynchronous Agents
 Producer/Consumer
 Load balancing
2
Building a Solution by Composition
 We often solve problems by reducing the problem
to a composition of known problems
 Finding the way to Habarana?
 Sorting 1 million integers
 Can we solve this with Mutex & Semaphores?
 Mutex for mutual exclusion
 Semaphores for signaling
 There is another level
3
Designing Parallel Algorithms
 Parallel algorithm design is not easily reduced to
simple recipes
 Parallel version of serial algorithm is not necessarily
optimum
 Good algorithms require creativity
 Goal
 Suggest a framework within which parallel algorithm
design can be explored
 Develop intuition as to what constitutes a good
parallel algorithm
4
Methodical Design
 Partitioning &
communication focus
on concurrency &
scalability
 Agglomeration &
mapping focus on
locality & other
performance issues
5
Source: www.drdobbs.com/parallel/designing-parallel-
algorithms-part-1/223100878
Methodical Design (Cont.)
1. Partitioning
 Decompose computation/data into small tasks/chunks
 Focus on recognizing opportunities for parallel
execution
 Practical issues such as no of CPUs are ignored
2. Communication
 Determine communication required to coordinate task
execution
 Define communication structures & algorithms
6
Methodical Design (Cont.)
3. Agglomeration
 Defined task & communication structures are
evaluated with respect to
 Performance requirements
 Implementation costs
 If necessary, tasks are combined into larger tasks to
improve
 Performance
 Reduce development costs
7
Source: www.drdobbs.com/architecture-and-design/designing-
parallel-algorithms-part-3/223500075
Methodical Design (Cont.)
4. Mapping
 Each task is assigned to a processor while attempting
to satisfy competing goals of
 Maximizing processor utilization
 Minimizing communication costs
 Static mapping
 At design/compile time
 Dynamic mapping
 At runtime by load-balancing algorithms
8
Parallel Algorithm Design Issues
 Efficiency
 Scalability
 Partitioning computations
 Domain decomposition – based on data
 Functional decomposition – based on computation
 Locality
 Spatial & temporal
 Synchronous & asynchronous communication
 Agglomeration to reduce communication
 Load-balancing
9
3 Ways to Parallelize
1. By Data
 Partition data & give it to different threads
2. By Task
 Partition task into smaller tasks & give it to different
threads
3. By Order
 Partition task into steps & give them to different threads
10
By Data
 Use SPMD model
 When data can be processed locally with lower
dependencies with other data
 Patterns
 Loop parallel, embarrassingly parallel
 Large data unit – under utilization
 Small data units – thrashing
 Chunk layout
 Based on dependencies & caching
 Example – Processing geographical data
11
By Task
 Task Parallel, Divide & Conquer
 Too many tasks – thrashing
 Too little tasks – under utilization
 Dependencies among tasks
 Removable
 Code transformations
 Separable
 Accumulation operations (average, sum, count)
 Extrema (max, min)
 Read only, Read/Write
12
By Order
 Pipeline & Asynchronous Agents
 Dependencies
 Temporal – before/after
 Same time
 None
13
Load Balancing
 Some threads will be busy while others are idle
 Counter by distributing load equally
 When cost of problem is well understood this is possible
 e.g., matrix multiplication, known tree walk
 Some other problems are not that simple
 Hard to predict how workload will be distributed  use
dynamic load balancing
 But require communication between threads/tasks
 2 methods for dynamic load balancing
 Task queues
 Work stealing
14
Task Queues
 Multiple instance of task queues (producer
consumer)
 Threads comes to the task queue after finishing a
task & grab next task
 Typically run with thread pool with fixed no of
threads
15
Source: http://blog.zenika.com
Work Stealing
 Every thread has a work/task queue
 When 1 thread runs out of work, it goes to other
task queue & “steal” the work
16
Source: http://karlsenchoi.blogspot.com
Efficiency = Maximizing Parallelism?
 Usually it is 2 things
 Run algorithm in MAX no of threads with minimal
communication/waiting
 When size of the problem grows, algorithm can handle
it by adding new resources
 It’s done by right architecture + tuning
 There are no clear way to do it
 Just like “design patterns” for OOP, people have
identified parallel programming patterns
17
Solution Patterns for Parallelism
 Loop Parallel
 Fork/Join
 Divide and Conquer
 Producer Consumer/ Pipe Line
 Asynchronous Agents
 Producer/Consumer
18
Loop Parallel
 If each iteration in a loop only depends on that
iteration results + read only data, each iteration
can run in a different thread
 As it’s based on data, also called data parallelism
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i])
}
19
Which for These are Loop Parallel?
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i-1])
}
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], C[i-1])
}
20
Implementing Loop Parallel
 OpenMP example
21
Fork/Join
 Fork job into smaller tasks (independent if
possible), perform them, & join them
 Examples
 Calculate the mean across an array
 Tree walk
 How to partition?
 By Data, e.g., SPMD
 By Task, e.g., MPSD
22
Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model
Fork/Join (Cont.)
 Size of work Unit
 Small units – thrashing
 Big Unit – imbalance
 Balancing load among threads
 Static allocation
 If data/task is completely known
 E.g., matrix addition
 Dynamic allocation (tree walks)
 Task queues
 Work Stealing
23
Implementing Fork/Join
 Pthreads
 OpenMP
24
Divide & Conquer
 Break problem into recursive sub-problems &
assign them to different threads
 Examples
 Quick sort
 Search for a value in a tree
 Calculating Fibonacci Sequence
 Often fork again, leads to an execution tree
 Recursion
 May or may not have a join step
 Deep tree – thrashing
 Shallow tree – under utilization 25
Divide & Conquer – Fibonacci
Sequence
Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein
26
Producer Consumer
 This pattern is often used, as it helps
dynamically balance workload
 E.g., crawling the Web
 Place new links in a queue so others can pick it up
27
Source: http://vichargrave.com/multithreaded-work-queue-in-c/
Pipeline
 Break a task into small steps (which may have
dependencies) & assign execution of steps to
different threads
 Example
 Read file, sort file, & write to file
 Work hand off from step-to-step
 Each task doesn’t gain, but if there are many
instances of the task, we get a better throughput
 Gain come from tuning
 Example – read/write are slow but sort is fast, can
add more threads to read/write & less threads to sort 28
Pipeline (Cont.)
 Long pipeline – high throughput
 Short pipeline – low latency
 Passing data from one stage to another
 Message passing
 Shared queues
29
Asynchronous Agents
 Here task is done by a set of agents
 Working in P2P fashion
 No clear structure
 They talk to each other via asynchronous messages
 Example – Detecting storms using weather data
 Many agents, each know some aspects about storms
 Weather events are sent to them, which in turn fire
other events, leading to detection
30
Source: http://blogs.msdn.com/
31

More Related Content

Similar to Solution Patterns for Parallel Programming

Design and analysis of computer algorithms
Design and analysis of computer algorithmsDesign and analysis of computer algorithms
Design and analysis of computer algorithms
Krishna Chaytaniah
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
Richard Ashworth
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Mumbai Academisc
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
butest
 

Similar to Solution Patterns for Parallel Programming (20)

Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Design and analysis of computer algorithms
Design and analysis of computer algorithmsDesign and analysis of computer algorithms
Design and analysis of computer algorithms
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software Performance
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Hpc 6 7
Hpc 6 7Hpc 6 7
Hpc 6 7
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Lect1.pptx
Lect1.pptxLect1.pptx
Lect1.pptx
 
Basic Terminology of Data Structure.pptx
Basic Terminology of Data Structure.pptxBasic Terminology of Data Structure.pptx
Basic Terminology of Data Structure.pptx
 
Divide and Conquer Case Study
Divide and Conquer Case StudyDivide and Conquer Case Study
Divide and Conquer Case Study
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
Paralle Programming in Python
Paralle Programming in PythonParalle Programming in Python
Paralle Programming in Python
 
ML.pdf
ML.pdfML.pdf
ML.pdf
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 

More from Dilum Bandara

More from Dilum Bandara (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Recently uploaded (20)

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 

Solution Patterns for Parallel Programming

  • 1. Solution Patterns for Parallel Programming CS4532 Concurrent Programming Dilum Bandara Dilum.Bandara@uom.lk Some slides adapted from Dr. Srinath Perera
  • 2. Outline  Designing parallel algorithms  Solution patterns for parallelism  Loop Parallel  Fork/Join  Divide & Conquer  Pipe Line  Asynchronous Agents  Producer/Consumer  Load balancing 2
  • 3. Building a Solution by Composition  We often solve problems by reducing the problem to a composition of known problems  Finding the way to Habarana?  Sorting 1 million integers  Can we solve this with Mutex & Semaphores?  Mutex for mutual exclusion  Semaphores for signaling  There is another level 3
  • 4. Designing Parallel Algorithms  Parallel algorithm design is not easily reduced to simple recipes  Parallel version of serial algorithm is not necessarily optimum  Good algorithms require creativity  Goal  Suggest a framework within which parallel algorithm design can be explored  Develop intuition as to what constitutes a good parallel algorithm 4
  • 5. Methodical Design  Partitioning & communication focus on concurrency & scalability  Agglomeration & mapping focus on locality & other performance issues 5 Source: www.drdobbs.com/parallel/designing-parallel- algorithms-part-1/223100878
  • 6. Methodical Design (Cont.) 1. Partitioning  Decompose computation/data into small tasks/chunks  Focus on recognizing opportunities for parallel execution  Practical issues such as no of CPUs are ignored 2. Communication  Determine communication required to coordinate task execution  Define communication structures & algorithms 6
  • 7. Methodical Design (Cont.) 3. Agglomeration  Defined task & communication structures are evaluated with respect to  Performance requirements  Implementation costs  If necessary, tasks are combined into larger tasks to improve  Performance  Reduce development costs 7 Source: www.drdobbs.com/architecture-and-design/designing- parallel-algorithms-part-3/223500075
  • 8. Methodical Design (Cont.) 4. Mapping  Each task is assigned to a processor while attempting to satisfy competing goals of  Maximizing processor utilization  Minimizing communication costs  Static mapping  At design/compile time  Dynamic mapping  At runtime by load-balancing algorithms 8
  • 9. Parallel Algorithm Design Issues  Efficiency  Scalability  Partitioning computations  Domain decomposition – based on data  Functional decomposition – based on computation  Locality  Spatial & temporal  Synchronous & asynchronous communication  Agglomeration to reduce communication  Load-balancing 9
  • 10. 3 Ways to Parallelize 1. By Data  Partition data & give it to different threads 2. By Task  Partition task into smaller tasks & give it to different threads 3. By Order  Partition task into steps & give them to different threads 10
  • 11. By Data  Use SPMD model  When data can be processed locally with lower dependencies with other data  Patterns  Loop parallel, embarrassingly parallel  Large data unit – under utilization  Small data units – thrashing  Chunk layout  Based on dependencies & caching  Example – Processing geographical data 11
  • 12. By Task  Task Parallel, Divide & Conquer  Too many tasks – thrashing  Too little tasks – under utilization  Dependencies among tasks  Removable  Code transformations  Separable  Accumulation operations (average, sum, count)  Extrema (max, min)  Read only, Read/Write 12
  • 13. By Order  Pipeline & Asynchronous Agents  Dependencies  Temporal – before/after  Same time  None 13
  • 14. Load Balancing  Some threads will be busy while others are idle  Counter by distributing load equally  When cost of problem is well understood this is possible  e.g., matrix multiplication, known tree walk  Some other problems are not that simple  Hard to predict how workload will be distributed  use dynamic load balancing  But require communication between threads/tasks  2 methods for dynamic load balancing  Task queues  Work stealing 14
  • 15. Task Queues  Multiple instance of task queues (producer consumer)  Threads comes to the task queue after finishing a task & grab next task  Typically run with thread pool with fixed no of threads 15 Source: http://blog.zenika.com
  • 16. Work Stealing  Every thread has a work/task queue  When 1 thread runs out of work, it goes to other task queue & “steal” the work 16 Source: http://karlsenchoi.blogspot.com
  • 17. Efficiency = Maximizing Parallelism?  Usually it is 2 things  Run algorithm in MAX no of threads with minimal communication/waiting  When size of the problem grows, algorithm can handle it by adding new resources  It’s done by right architecture + tuning  There are no clear way to do it  Just like “design patterns” for OOP, people have identified parallel programming patterns 17
  • 18. Solution Patterns for Parallelism  Loop Parallel  Fork/Join  Divide and Conquer  Producer Consumer/ Pipe Line  Asynchronous Agents  Producer/Consumer 18
  • 19. Loop Parallel  If each iteration in a loop only depends on that iteration results + read only data, each iteration can run in a different thread  As it’s based on data, also called data parallelism int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], B[i]) } 19
  • 20. Which for These are Loop Parallel? int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], B[i-1]) } int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], C[i-1]) } 20
  • 21. Implementing Loop Parallel  OpenMP example 21
  • 22. Fork/Join  Fork job into smaller tasks (independent if possible), perform them, & join them  Examples  Calculate the mean across an array  Tree walk  How to partition?  By Data, e.g., SPMD  By Task, e.g., MPSD 22 Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model
  • 23. Fork/Join (Cont.)  Size of work Unit  Small units – thrashing  Big Unit – imbalance  Balancing load among threads  Static allocation  If data/task is completely known  E.g., matrix addition  Dynamic allocation (tree walks)  Task queues  Work Stealing 23
  • 25. Divide & Conquer  Break problem into recursive sub-problems & assign them to different threads  Examples  Quick sort  Search for a value in a tree  Calculating Fibonacci Sequence  Often fork again, leads to an execution tree  Recursion  May or may not have a join step  Deep tree – thrashing  Shallow tree – under utilization 25
  • 26. Divide & Conquer – Fibonacci Sequence Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein 26
  • 27. Producer Consumer  This pattern is often used, as it helps dynamically balance workload  E.g., crawling the Web  Place new links in a queue so others can pick it up 27 Source: http://vichargrave.com/multithreaded-work-queue-in-c/
  • 28. Pipeline  Break a task into small steps (which may have dependencies) & assign execution of steps to different threads  Example  Read file, sort file, & write to file  Work hand off from step-to-step  Each task doesn’t gain, but if there are many instances of the task, we get a better throughput  Gain come from tuning  Example – read/write are slow but sort is fast, can add more threads to read/write & less threads to sort 28
  • 29. Pipeline (Cont.)  Long pipeline – high throughput  Short pipeline – low latency  Passing data from one stage to another  Message passing  Shared queues 29
  • 30. Asynchronous Agents  Here task is done by a set of agents  Working in P2P fashion  No clear structure  They talk to each other via asynchronous messages  Example – Detecting storms using weather data  Many agents, each know some aspects about storms  Weather events are sent to them, which in turn fire other events, leading to detection 30 Source: http://blogs.msdn.com/
  • 31. 31

Editor's Notes

  1. Shovel example
  2. Along A6 after Dambulla