1) Parallel algorithms divide a problem into sub-problems that can be solved concurrently. This document discusses techniques for decomposing problems into parallel tasks, including recursive decomposition, data decomposition, and partitioning input/output/intermediate data.
2) Data decomposition involves partitioning the data used in computations and assigning each partition to a separate task. This is a common technique that can induce concurrency in algorithms that operate on large data structures.
3) Partitioning can involve dividing input data, output data, or intermediate data among tasks. The goal is to identify independent work that can be done simultaneously while minimizing communication between tasks.
This document provides an overview of key concepts in designing parallel programs, including manual vs automatic parallelization, partitioning work, communication factors like cost, latency and bandwidth, load balancing, granularity, and Amdahl's law. It discusses analyzing problems to identify parallelism, partitioning work via domain and functional decomposition, and handling data dependencies. Types of parallelizing compilers and their limitations are also covered.
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
The main principles of parallel algorithm design are discussed here. For more information: visit, https://sites.google.com/view/vajira-thambawita/leaning-materials
The document discusses principles of parallel algorithm design. It introduces parallel algorithms, decomposition techniques, and characteristics of tasks and interactions. Recursive, data, exploratory, and hybrid decomposition techniques are covered. Mapping tasks to processes aims to minimize execution time by balancing load, minimizing interaction between processes, and assigning independent tasks to different processes. Granularity, degree of concurrency, and critical path length are used to analyze decompositions and their performance.
Chapter 3 principles of parallel algorithm designDenisAkbar1
The document provides an overview of parallel algorithms and concurrency. It discusses various techniques for decomposing problems into tasks, including recursive decomposition, data decomposition, and intermediate data partitioning. It also covers characteristics of task interactions and dependencies, as well as methods for mapping tasks to processes to optimize load balancing and communication overhead. Examples are provided to illustrate matrix multiplication, database queries, and other problems.
This document discusses various aspects of parallel computing including:
- Different levels of parallelism from instruction-level to job-level and their tradeoffs.
- Data and control dependence relationships that enable or prevent parallel execution.
- The mismatch between software and hardware parallelism and techniques to address it.
- The role of compilers in exploiting hardware parallelism and interacting with architecture design.
- Program partitioning, scheduling, grain size, and tradeoffs between computation and communication latency.
This document provides an overview of parallel and distributed computing. It begins by outlining the key learning outcomes of studying this topic, which include defining parallel algorithms, analyzing parallel performance, applying task decomposition techniques, and performing parallel programming. It then reviews the history of computing from the batch era to today's network era. The rest of the document discusses parallel computing concepts like Flynn's taxonomy, shared vs distributed memory systems, limits of parallelism based on Amdahl's law, and different types of parallelism including bit-level, instruction-level, data, and task parallelism. It concludes by covering parallel implementation in both software through parallel programming and in hardware through parallel processing.
This document provides an introduction to parallel computing. It discusses serial versus parallel computing and how parallel computing involves simultaneously using multiple compute resources to solve problems. Common parallel computer architectures involve multiple processors on a single computer or connecting multiple standalone computers together in a cluster. Parallel computers can use shared memory, distributed memory, or hybrid memory architectures. The document outlines some of the key considerations and challenges in moving from serial to parallel code such as decomposing problems, identifying dependencies, mapping tasks to resources, and handling dependencies.
This document provides an overview of key concepts in designing parallel programs, including manual vs automatic parallelization, partitioning work, communication factors like cost, latency and bandwidth, load balancing, granularity, and Amdahl's law. It discusses analyzing problems to identify parallelism, partitioning work via domain and functional decomposition, and handling data dependencies. Types of parallelizing compilers and their limitations are also covered.
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
The main principles of parallel algorithm design are discussed here. For more information: visit, https://sites.google.com/view/vajira-thambawita/leaning-materials
The document discusses principles of parallel algorithm design. It introduces parallel algorithms, decomposition techniques, and characteristics of tasks and interactions. Recursive, data, exploratory, and hybrid decomposition techniques are covered. Mapping tasks to processes aims to minimize execution time by balancing load, minimizing interaction between processes, and assigning independent tasks to different processes. Granularity, degree of concurrency, and critical path length are used to analyze decompositions and their performance.
Chapter 3 principles of parallel algorithm designDenisAkbar1
The document provides an overview of parallel algorithms and concurrency. It discusses various techniques for decomposing problems into tasks, including recursive decomposition, data decomposition, and intermediate data partitioning. It also covers characteristics of task interactions and dependencies, as well as methods for mapping tasks to processes to optimize load balancing and communication overhead. Examples are provided to illustrate matrix multiplication, database queries, and other problems.
This document discusses various aspects of parallel computing including:
- Different levels of parallelism from instruction-level to job-level and their tradeoffs.
- Data and control dependence relationships that enable or prevent parallel execution.
- The mismatch between software and hardware parallelism and techniques to address it.
- The role of compilers in exploiting hardware parallelism and interacting with architecture design.
- Program partitioning, scheduling, grain size, and tradeoffs between computation and communication latency.
This document provides an overview of parallel and distributed computing. It begins by outlining the key learning outcomes of studying this topic, which include defining parallel algorithms, analyzing parallel performance, applying task decomposition techniques, and performing parallel programming. It then reviews the history of computing from the batch era to today's network era. The rest of the document discusses parallel computing concepts like Flynn's taxonomy, shared vs distributed memory systems, limits of parallelism based on Amdahl's law, and different types of parallelism including bit-level, instruction-level, data, and task parallelism. It concludes by covering parallel implementation in both software through parallel programming and in hardware through parallel processing.
This document provides an introduction to parallel computing. It discusses serial versus parallel computing and how parallel computing involves simultaneously using multiple compute resources to solve problems. Common parallel computer architectures involve multiple processors on a single computer or connecting multiple standalone computers together in a cluster. Parallel computers can use shared memory, distributed memory, or hybrid memory architectures. The document outlines some of the key considerations and challenges in moving from serial to parallel code such as decomposing problems, identifying dependencies, mapping tasks to resources, and handling dependencies.
This document provides an overview of the topics that will be covered in the CS 3006 Parallel and Distributed Computing course. It introduces the course instructor, textbook, schedule, evaluation criteria, and pre-requisites. The first three lectures are also summarized, covering introduction and definitions, shared and distributed memory systems, parallel execution terms and definitions, overhead in parallel computing, speed-up and Amdahl's law, and Flynn's taxonomy of computer architectures.
This document discusses parallel computing fundamentals including parallel architectures, problem decomposition methods, data parallel and message passing models, and key parallel programming issues. It covers the basics of distributed and shared memory architectures. It describes domain and functional decomposition, and how each approach distributes work. It contrasts data parallel directives-based languages with message passing approaches. Finally, it discusses important issues for parallel programs like load balancing, minimizing communication, and overlapping communication and computation.
The theory behind parallel computing is covered here. For more theoretical knowledge: https://sites.google.com/view/vajira-thambawita/leaning-materials
This document discusses parallel computing. It provides examples of tasks that can be solved faster through parallel processing by dividing the work among multiple processors. The key benefits of parallel computing are speeding up tasks and solving problems too large for a single processor. It also discusses limits of parallel computing such as load balancing and Amdahl's law, which places theoretical limits on speedup from additional processors.
The document discusses parallel computing techniques using threads. It describes domain decomposition, where a problem is divided into independent tasks that can be executed concurrently by threads. Matrix multiplication is provided as an example, where each element of the resulting matrix is computed independently by a thread. Functional decomposition, where a problem is broken into distinct computational functions, is also introduced. Programming models for threads in Java, .NET and POSIX are overviewed.
The document discusses parallel computing concepts including concurrency vs parallelism, Amdahl's law, task dependency graphs, and common patterns for parallelizing algorithms such as task-level, divide-and-conquer, pipeline, and repository models. Key points are that parallelism requires multiple processors executing tasks simultaneously, while concurrency allows interleaving of tasks; Amdahl's law describes theoretical speedup limits based on sequential portions of code; and understanding hardware and dependencies informs choice of parallelization patterns.
Parallelism involves executing multiple processes simultaneously using two or more processors. There are different types of parallelism including instruction level, job level, and program level. Parallelism is used in supercomputing to solve complex problems more quickly in fields like weather forecasting, climate modeling, engineering, and material science. Parallel computers can be classified based on whether they have a single or multiple instruction and data streams, including SISD, MISD, SIMD, and MIMD architectures. Shared memory parallel computers allow processors to access a global address space but can have conflicts when simultaneous writes occur, while message passing computers communicate via messages to avoid conflicts. Factors like software overhead and load balancing can limit the speedup achieved by parallel algorithms
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
Please contact me to download this pres.A comprehensive presentation on the field of Parallel Computing.It's applications are only growing exponentially day by days.A useful seminar covering basics,its classification and implementation thoroughly.
Visit www.ameyawaghmare.wordpress.com for more info
This document discusses parallel matrix multiplication. It describes how to break the problem down into independent inner product operations that can be computed concurrently across multiple processors. Specifically, it presents a parallel algorithm that:
1) Divides the matrix multiplication work into inner product tasks that can be computed independently in parallel;
2) Assigns each inner product task to a separate processor using a round-robin approach; and
3) Waits for all processors to complete their tasks before outputting the final result matrix.
The document provides an overview of NoSQL databases and their advantages over relational databases for handling large, distributed datasets in cloud computing environments. Some key points:
- NoSQL databases can scale horizontally to support distributed and heterogeneous data better than relational databases. They do not require rigid schemas and support flexible data models.
- NoSQL is well-suited for cloud computing where data is distributed globally and data volumes are large and growing rapidly. It reduces the need to maintain relationships between distributed records.
- Common NoSQL data models include key-value, document, columnar, and graph databases. These models provide more flexibility than relational databases for semi-structured and unstructured data.
This document discusses analytical modeling of parallel systems. It begins by outlining topics like sources of overhead in parallel programs, performance metrics, and scalability. It then discusses basics of analytical modeling, noting that parallel runtime depends on input size, number of processors, and machine communication parameters. Several performance measures are introduced, like wall clock time and speedup. Sources of overhead like idling, excess computation, and communication are described. Metrics like parallel time, total overhead, speedup, and efficiency are formally defined. The impact of non-cost optimality and ways to build granularity are discussed. Finally, scaling characteristics and isoefficiency as a metric of scalability are covered.
The document discusses the limits of parallelism and different memory organizations of parallel computers. It introduces Amdahl's law and Gustafson-Barsi's law, which describe the theoretical speedup limits based on how much of a program can be parallelized. Shared memory multiprocessors can provide shared access to memory but do not scale well. Distributed memory machines partition memory across nodes but require message passing between nodes.
This document discusses MapReduce and its suitability for processing large datasets across distributed systems. It describes challenges like node failures, network bottlenecks and the motivation for a simple programming model that can handle massive computations and datasets across thousands of machines. MapReduce provides a programming model using map and reduce functions that hides complexities of parallelization, fault tolerance and load balancing. It has been widely adopted for applications involving log analysis, indexing large datasets, iterative graph processing and more.
This document discusses algorithms and their analysis. It defines an algorithm as a finite sequence of unambiguous instructions that terminate in a finite amount of time. It discusses areas of study like algorithm design techniques, analysis of time and space complexity, testing and validation. Common algorithm complexities like constant, logarithmic, linear, quadratic and exponential are explained. Performance analysis techniques like asymptotic analysis and amortized analysis using aggregate analysis, accounting method and potential method are also summarized.
This document provides an overview of MapReduce and HBase in big data processing. It discusses how MapReduce distributes tasks across nodes in a cluster and uses map and reduce functions to process large datasets in parallel. It also explains how HBase can be used for storage with MapReduce, providing fast access and retrieval of large amounts of flexible, column-oriented data.
The document discusses MapReduce, a framework for processing large datasets in a distributed manner. It begins by explaining how MapReduce addresses issues around scaling computation across large networks. It then provides details on the key features and working of MapReduce, including how it divides jobs into map and reduce phases that operate in parallel on data blocks. Examples are given to illustrate how MapReduce can be used to count word frequencies in text and tally population statistics from a census.
The document discusses problem-solving and design skills needed for computer programming. It covers several key topics:
1. Candidates should understand top-down design and be able to break down computer systems into subsystems using structure diagrams, flowcharts, pseudocode, and subroutines.
2. Candidates should be able to work with algorithms - explaining them, suggesting test data, and identifying/fixing errors. They should be able to produce algorithms for problems.
3. Top-down design is described as the process of breaking down a computer system into subsystems, then breaking each subsystem into smaller subsystems, until each performs a single action.
This document provides an overview of the topics that will be covered in the CS 3006 Parallel and Distributed Computing course. It introduces the course instructor, textbook, schedule, evaluation criteria, and pre-requisites. The first three lectures are also summarized, covering introduction and definitions, shared and distributed memory systems, parallel execution terms and definitions, overhead in parallel computing, speed-up and Amdahl's law, and Flynn's taxonomy of computer architectures.
This document discusses parallel computing fundamentals including parallel architectures, problem decomposition methods, data parallel and message passing models, and key parallel programming issues. It covers the basics of distributed and shared memory architectures. It describes domain and functional decomposition, and how each approach distributes work. It contrasts data parallel directives-based languages with message passing approaches. Finally, it discusses important issues for parallel programs like load balancing, minimizing communication, and overlapping communication and computation.
The theory behind parallel computing is covered here. For more theoretical knowledge: https://sites.google.com/view/vajira-thambawita/leaning-materials
This document discusses parallel computing. It provides examples of tasks that can be solved faster through parallel processing by dividing the work among multiple processors. The key benefits of parallel computing are speeding up tasks and solving problems too large for a single processor. It also discusses limits of parallel computing such as load balancing and Amdahl's law, which places theoretical limits on speedup from additional processors.
The document discusses parallel computing techniques using threads. It describes domain decomposition, where a problem is divided into independent tasks that can be executed concurrently by threads. Matrix multiplication is provided as an example, where each element of the resulting matrix is computed independently by a thread. Functional decomposition, where a problem is broken into distinct computational functions, is also introduced. Programming models for threads in Java, .NET and POSIX are overviewed.
The document discusses parallel computing concepts including concurrency vs parallelism, Amdahl's law, task dependency graphs, and common patterns for parallelizing algorithms such as task-level, divide-and-conquer, pipeline, and repository models. Key points are that parallelism requires multiple processors executing tasks simultaneously, while concurrency allows interleaving of tasks; Amdahl's law describes theoretical speedup limits based on sequential portions of code; and understanding hardware and dependencies informs choice of parallelization patterns.
Parallelism involves executing multiple processes simultaneously using two or more processors. There are different types of parallelism including instruction level, job level, and program level. Parallelism is used in supercomputing to solve complex problems more quickly in fields like weather forecasting, climate modeling, engineering, and material science. Parallel computers can be classified based on whether they have a single or multiple instruction and data streams, including SISD, MISD, SIMD, and MIMD architectures. Shared memory parallel computers allow processors to access a global address space but can have conflicts when simultaneous writes occur, while message passing computers communicate via messages to avoid conflicts. Factors like software overhead and load balancing can limit the speedup achieved by parallel algorithms
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
Please contact me to download this pres.A comprehensive presentation on the field of Parallel Computing.It's applications are only growing exponentially day by days.A useful seminar covering basics,its classification and implementation thoroughly.
Visit www.ameyawaghmare.wordpress.com for more info
This document discusses parallel matrix multiplication. It describes how to break the problem down into independent inner product operations that can be computed concurrently across multiple processors. Specifically, it presents a parallel algorithm that:
1) Divides the matrix multiplication work into inner product tasks that can be computed independently in parallel;
2) Assigns each inner product task to a separate processor using a round-robin approach; and
3) Waits for all processors to complete their tasks before outputting the final result matrix.
The document provides an overview of NoSQL databases and their advantages over relational databases for handling large, distributed datasets in cloud computing environments. Some key points:
- NoSQL databases can scale horizontally to support distributed and heterogeneous data better than relational databases. They do not require rigid schemas and support flexible data models.
- NoSQL is well-suited for cloud computing where data is distributed globally and data volumes are large and growing rapidly. It reduces the need to maintain relationships between distributed records.
- Common NoSQL data models include key-value, document, columnar, and graph databases. These models provide more flexibility than relational databases for semi-structured and unstructured data.
This document discusses analytical modeling of parallel systems. It begins by outlining topics like sources of overhead in parallel programs, performance metrics, and scalability. It then discusses basics of analytical modeling, noting that parallel runtime depends on input size, number of processors, and machine communication parameters. Several performance measures are introduced, like wall clock time and speedup. Sources of overhead like idling, excess computation, and communication are described. Metrics like parallel time, total overhead, speedup, and efficiency are formally defined. The impact of non-cost optimality and ways to build granularity are discussed. Finally, scaling characteristics and isoefficiency as a metric of scalability are covered.
The document discusses the limits of parallelism and different memory organizations of parallel computers. It introduces Amdahl's law and Gustafson-Barsi's law, which describe the theoretical speedup limits based on how much of a program can be parallelized. Shared memory multiprocessors can provide shared access to memory but do not scale well. Distributed memory machines partition memory across nodes but require message passing between nodes.
This document discusses MapReduce and its suitability for processing large datasets across distributed systems. It describes challenges like node failures, network bottlenecks and the motivation for a simple programming model that can handle massive computations and datasets across thousands of machines. MapReduce provides a programming model using map and reduce functions that hides complexities of parallelization, fault tolerance and load balancing. It has been widely adopted for applications involving log analysis, indexing large datasets, iterative graph processing and more.
This document discusses algorithms and their analysis. It defines an algorithm as a finite sequence of unambiguous instructions that terminate in a finite amount of time. It discusses areas of study like algorithm design techniques, analysis of time and space complexity, testing and validation. Common algorithm complexities like constant, logarithmic, linear, quadratic and exponential are explained. Performance analysis techniques like asymptotic analysis and amortized analysis using aggregate analysis, accounting method and potential method are also summarized.
This document provides an overview of MapReduce and HBase in big data processing. It discusses how MapReduce distributes tasks across nodes in a cluster and uses map and reduce functions to process large datasets in parallel. It also explains how HBase can be used for storage with MapReduce, providing fast access and retrieval of large amounts of flexible, column-oriented data.
The document discusses MapReduce, a framework for processing large datasets in a distributed manner. It begins by explaining how MapReduce addresses issues around scaling computation across large networks. It then provides details on the key features and working of MapReduce, including how it divides jobs into map and reduce phases that operate in parallel on data blocks. Examples are given to illustrate how MapReduce can be used to count word frequencies in text and tally population statistics from a census.
The document discusses problem-solving and design skills needed for computer programming. It covers several key topics:
1. Candidates should understand top-down design and be able to break down computer systems into subsystems using structure diagrams, flowcharts, pseudocode, and subroutines.
2. Candidates should be able to work with algorithms - explaining them, suggesting test data, and identifying/fixing errors. They should be able to produce algorithms for problems.
3. Top-down design is described as the process of breaking down a computer system into subsystems, then breaking each subsystem into smaller subsystems, until each performs a single action.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
3. What is Algorithm
• An algorithm is a sequence of steps that take inputs
from the user and after some computation, produces an
output.
• As per the architecture, there are two types of computers
– Sequential Computer
– Parallel Computer
4. Cont..
• Depending on the architecture of computers, we have two types of
algorithms
• Sequential Algorithm − An algorithm in which some consecutive steps of
instructions are executed in a chronological order to solve a problem.
• Parallel Algorithm − The problem is divided into sub-problems and are
executed in parallel to get individual outputs. Later on, these individual
outputs are combined together to get the final desired output.
OR
• A parallel algorithm is an algorithm that can execute several instructions
simultaneously on different processing devices and then combine all the
individual outputs to produce the final result.
5. Constructing a Parallel Algorithm
• identify portions of work that can be performed
concurrently
• map concurrent portions of work onto multiple
processes running in parallel
• distribute a program’s input, output, and
intermediate data
• manage accesses to shared data: avoid conflicts
• synchronize the processes at stages of the parallel
program execution
6.
7. Communications
• Who Needs Communications? The need for communications
between tasks depends upon your problem:
• YOU DON'T NEED COMMUNICATIONS
– Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. These types of problems are often
called embarrassingly parallel - little or no communications are required
• Need for communication :Most parallel applications are not quite so
simple, and do require tasks to share data with each other.
• For example, a 2-D heat diffusion problem requires a task to know the
temperatures calculated by the tasks that have neighboring data. Changes to
neighboring data has a direct effect on that task's data.
•
8. COMMUNICATION OVERHEAD
• Inter-task communication virtually always implies overhead.
• Machine cycles and resources that could be used for computation are instead used to
package and transmit data.
• Communications frequently require some type of synchronization between tasks,
which can result in tasks spending time "waiting" instead of doing work.
• Competing communication traffic can saturate the available network bandwidth,
further aggravating performance problems.
9. SYNCHRONOUS(
coordination
) VS. ASYNCHRONOUS COMMUNICATIONS
• Synchronous communications require some type of "handshaking" between tasks
that are sharing data. This can be explicitly structured in code by the programmer, or
it may happen at a lower level unknown to the programmer.
• Synchronous communications are often referred to as blocking communications since
other work must wait until the communications have completed.
• Asynchronous communications allow tasks to transfer data independently from
one another. For example, task 1 can prepare and send a message to task 2, and
then immediately begin doing other work. When task 2 actually receives the data
doesn't matter.
– Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the
communications are taking place.
17. Degree of Concurrency
• Degree of Concurrency: # of tasks that can execute in
parallel
• --maximum degree of concurrency: largest number of
concurrent tasks at any point of the execution
• --average degree of concurrency: average number of
tasks that can be executed concurrently
– Degree of Concurrency vs. Task Granularity
18.
19.
20.
21.
22. • Speedup= sequential execution time/parallel execution
time
• •Parallel efficiency = sequential execution time/(parallel
execution time ×processors used)
23.
24. Task Interaction graph
• The pattern of interaction among tasks is captured by
what is known as Task Interaction graph.
• Node= task
• Edges connect task that interact with each other
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38. 3.1.3 Processes and Mapping
(1/5)
38
■ In general, the number of tasks in a decomposition
exceeds the number of processing elements
available.
■ For this reason, a parallel algorithm must also
provide a mapping of tasks to processes.
39. Processes and Mapping (2/5)
39
Note: We refer to the mapping as being from tasks to
processes, as opposed to processors. This is because typical
programming APIs, as we shall see, do not allow easy
binding of tasks to physical processors. Rather, we
aggregate tasks into processes and rely on the system to
map these processes to physical processors(CPU).
We use processes, not in the UNIX sense of a process,
rather, simply as a collection of tasks and associated data.
40. Processes and Mapping (3/5)
40
Dr. Hanif Durad
■ Appropriate mapping of tasks to processes is critical to the parallel
performance of an algorithm.
■
■
■ Mappings are determined by both the task dependency and task interaction
graphs.
Task dependency graphs can be used to ensure that work is equally spread
across all processes at any point (minimum idling and optimal load balance).
Task interaction graphs can be used to make sure that processes need minimum
interaction with other processes (minimum communication).
41. Processes and Mapping (4/5)
41
An appropriate mapping must minimize parallel
execution time by:
■ Mapping independent tasks to different processes.
■ Assigning tasks on critical path to processes as soon as they become available.
■ Minimizing interaction between processes by mapping tasks with dense
interactions to the same process.
42. Processes and Mapping (5/5)
42
Dr. Hanif Durad
■ Difficulty: criteria often conflict with one
another
■ e.g. no decomposition minimizes interactions but no
speedup!
43. Processes and Mapping: Example
(1/2)
43
Dr. Hanif Durad
■ Mapping tasks in the database query
decomposition to processes.
■ These mappings were arrived at by viewing the dependency graph in
terms of levels (no two nodes in a level have dependencies).
Tasks within a single level are then assigned to different processes.
■
45. 3.1.4 Processes vs. processors (1/2)
■
■
Processors are physical resources
Processes provide a convenient way of abstracting or virtualizing a multiprocessor.
Write parallel programs in terms of processes not physical processors;
Mapping processes to processors is a subsequent step.
The number of processes does not have to be the same as the number of processors
available to the program.
■
■
■
■
■ If there are more processes, they are multiplexed onto the available processors;
If there are fewer processes, then some processors will remain idle
46. Processes vs. processors (2/2)
■ The correct OS definition of a process is that of an
address space and one or more threads of control that
share that address space.
Thus, processes and threads are distinguished in that
definition.
In what follows: assume that a process has only one
thread of control.
■
■
47. Décomposition Techniques for Parallel
Algorithme
• So how does one decompose a task into various subtasks?
• While there is no single recipe that works for all problems, we
present a set of commonly used techniques that apply to
broad classes of problems. These include:
1. recursive decomposition
2. data decomposition
3. exploratory decomposition
4. speculative decomposition
48. 1.recursive decomposition
• Generally suited to problems(Task) that are solved using
the divide and-conquer strategy.
• A given problem is first decomposed into a set of sub
problems.
• These sub-problems are recursively decomposed
further until a desired granularity is reached.
54. ■ Identify the data on which computations are performed.
■ Partition this data across various tasks.
■ This partitioning induces a decomposition of the problem.
■ Data can be partitioned in various ways - this critically impacts performance of a parallel
algorithm.
66
3.2.2 Data Decomposition (1/3)
55. 3.2.2 Data Decomposition (2/3)
■ Is a powerful and commonly used method for deriving
concurrency in algorithms that operate on large data structures.
■ The decomposition of computations is done in two steps.
1. The data on which the computations are performed is
partitioned,
2. This data partitioning is used to induce a partitioning of the
computations into tasks.
56. Data Decomposition (3/3)
■ The operations that these tasks perform on different data
partitions are usually similar (e.g., matrix multiplication that
follows) or are chosen from a small set of operations (e.g., LU
factorization).
■ One must explore and evaluate all possible ways of partitioning
the data and determine which one yields a natural and efficient
computational decomposition.
57. 3.2.2.1 Partitioning Output Data
(1/10)
■ In many computations, each element of the output can be
computed independently of others as a function of the input.
■ In such computations, a partitioning of the output data
automatically induces a decomposition of the problems into tasks
■ each task is assigned the work of computing a portion of the
output
58. Consider the problem of multiplying two n x n matrices A and B to
yield matrix C. The output matrix C can be partitioned into four
tasks as follows:
Partitioning Output Data: Example-1
59. 71
Partitioning Output Data : Example-1 (2/3)
■ The 4 sub-matrices of C, roughly of size
■ (n/2 x n/2) each, are then independently
computed by 4 tasks as the sums of the appropriate products
of sub-matrices of A and B
■Other task decompositions possible
61. Partitioning Output Data : Example -2 (1/6)
■ Computing frequencies of itemsets in a transaction database.
■ Given a set T containing n transactions and a set I
containing m itemsets.
■ Each transaction and itemset contains a small no. items, out of a possible set of items.
(5/10)
62. Partitioning Output Data :
Example -2 (2/6)
■ Example: T is a grocery stores database of customer sales with each
transaction being an individual grocery list of a shopper & itemset could be a
group of items in the store.
■ If the store desires to find out how many customers bought each of the
designated groups of items, then it would need to find the no. times that
each itemset in I appears in all the transactions (the no. transactions of
which each itemset is a subset of)
Dr. Hanif Durad 74
65. Partitioning Output Data :
Example -2 (5/6)
■ Fig. (a) shows an example
■ The database shown consists of 10 transactions, and
■ We are interested in computing the frequency of the 8 itemsets shown in
the second column.
■ The actual frequencies of these itemsets in the database (output) are shown
in the third column.
■ For instance, itemset {D, K} appears twice, once in
the second and once in the ninth transaction.
66. Partitioning Output Data :
Example -2 (6/6)
■ Fig. (b) shows how the computation of
frequencies of the itemsets can be decomposed
into 2 tasks
■ by partitioning the output into 2 parts and having each
task compute its half of the frequencies.
Dr. Hanif Durad
(10/10)
67. 3.2.2.2 Partitioning Input Data
(1/6)
■ Remark: Partitioning of output data can be performed only if
each output can be naturally computed as a function of the
input.
In many algorithms, it is not possible or desirable to partition
the output data.
For example:
■
■
■
■ While finding the minimum, maximum, or the sum of a set of numbers,
the output is a single unknown value.
In a sorting algorithm, the individual elements of the output cannot be
efficiently determined in isolation.
68. Partitioning Input Data (2/6)
68
Dr. Hanif Durad
■ It is sometimes possible to partition the input data, and
then use this partitioning to induce concurrency.
■
■ A task is created for each partition of the input data and this task
performs as much computation as possible using these local data.
Solutions to tasks induced by input partitions may not directly solve
original problem
■ In such cases, a follow-up computation is needed to combine the
results.
69. Partitioning Input Data
Example -1
69
Dr. Hanif Durad
■ Example: finding the sum of N numbers using p
processes (N > p):
■
■
■ We can partition the input into p subsets of nearly equal
sizes.
Each task then computes the sum of the numbers in one of
the subsets.
Finally, the p partial results can be added up to yield the
final result
70. Partitioning Input Data
Example -2 (1/3)
70
Dr. Hanif Durad
■ The problem of computing the frequency of a
set of itemsets
■ Can also be decomposed based on a partitioning
of input data.
■ Fig. shows a decomposition based on a
partitioning of the input set of transactions.
71. Dr. Hanif Durad (5/6) 83
Partitioning Input Data : Example-2
(2/3)
Intermediate
result
Intermediate
result
72. Partitioning Input Data
Example -2 (3/3)
■ Each of the two tasks computes the frequencies of all the
itemsets in its respective subset of transactions.
■ The two sets of frequencies, which are the independent outputs
of the two tasks, represent intermediate results.
■ Combining the intermediate results by pair wise addition yields
the final result.
(6/6)
84
73. 3.2.2.3 Partitioning both Input and Output
Data (1/2)
■
In some cases, in which it is possible to partition the output data, partitioning
of input data can offer additional concurrency.
Example: consider the 4-way decomposition shown in Fig. for computing
itemset frequencies.
■
■
■
■ both the transaction set and the frequencies are divided into two parts and
a different one of the four possible combinations is assigned to each of the
four tasks.
Each task then computes a local set of frequencies.
Finally, the outputs of Tasks 1 and 3 are added together, as are the outputs of
Tasks 2 and 4.
75. Dr. Hanif Durad 87
3.2.2.4 Intermediate Data
Partitioning (1/8)
■ Partitioning intermediate data can sometimes lead to higher concurrency than
partitioning input or output data.
Often, the intermediate data are not generated explicitly in the serial
algorithm for solving the problem
Some restructuring of the original alg. May be required to use intermediate
data partitioning to induce a decomposition
■
■
76. Intermediate Data Partitioning
(2/8)
■ Example: matrix multiplication
■ Recall that the decompositions induced by a 2x 2 partitioning of the output matrix
C have a maximum degree of concurrency of four.
■ Increase the degree of concurrency by introducing an intermediate stage in which
8 tasks compute their respective product submatrices and store the results in a
temporary 3-d matrix D, as shown in Fig.
■ Submatrix Dk,i,j is the product of Ai,k and Bk,j.
77. Dr. Hanif Durad 89
Intermediate Data Partitioning: (3/8)
Example
78. Intermediate Data Partitioning
(4/8)
■ A partitioning of the intermediate matrix D induces a decomposition into
eight tasks.
■ After the multiplication phase, a relatively inexpensive matrix addition step
can compute the result matrix C
■ All submatrices D*,i,j with the same second and third dimensions i and j are
added to yield Ci,j.
80. Intermediate Data Partitioning
(6/8)
■ The eight tasks numbered 1 through 8 in Fig. perform O(n3/8)
work each in multiplying (n/2 x n/2 ) submatrices of A and B
■ Then, four tasks numbered 9 through 12 spend O(n2/4) time each
in adding the appropriate (n/2 x n/2) submatrices of the
intermediate matrix D to yield the final result matrix C
■ Second Fig. shows the task dependency graph
82. Intermediate Data Partitioning:
Example (8/8)
The task dependency graph for the
decomposition (shown in previous foil) into 12
tasks is as follows:
82
Dr. Hanif Durad
83. The Owner Computes Rule (1/2)
■
A decomposition based on partitioning output or input
data is also widely referred to as the owner-computes
rule.
The idea behind this rule is that each partition performs
all the computations involving data that it owns.
Depending on the nature of the data or the type of data-
partitioning, the owner-computes rule may mean
different things
■
■
84. The Owner Computes Rule (2/2)
■ When we assign partitions of the input data to
tasks, then the owner computes rule means that a
task performs all the computations that can be
done using these data.
■ If we partition the output data, then the owner-
computes rule means that a task computes all the
data in the partition assigned to it.
85. 3.2.3 Exploratory Decomposition
(1/10)
■ In many cases, the decomposition of the problem goes hand-in-hand with its execution.
■ These problems typically involve the exploration (search) of a state space of solutions.
■ Problems in this class include a variety of
discrete optimization
programming, QAP
problems (0/1 integer
(quadratic assignment
problem), etc.), theorem proving, game playing,
86. Exploratory Decomposition-(2/10)
Example-1=> 15-puzzle problem
■ Consists of 15 tiles numbered 1 through 15 and one blank
tile placed in a 4 x 4 grid
A tile can be moved into the blank position from a
position adjacent to it, thus creating a blank in the tile's
original position.
Four moves are possible: up, down, left, and right.
The initial and final configurations of the tiles are
■
■
■
87. Exploratory Decomposition-
Example-1 =>15-puzzle problem (1/4)
■ Objective is to determine any sequence or a shortest
sequence of moves that transforms the initial configuration
to the final configuration
Fig. illustrates sample initial and final configurations and a
sequence of moves leading from the initial configuration to
the final configuration
■
(3/10)
88. Exploratory Decomposition-
Example-1 =>15-puzzle problem (2/4)
■ The puzzle is typically solved using tree-search techniques.
■
■
■
■ From initial configration, all possible successor configurations
are generated.
May have 2, 3, or 4 possible successor configurations depending
occupation of the empty slot by one of its neighbors.
The task of finding a path from initial to final configuration now
translates to finding a path from one of these newly generated
configurations to the final configuration.
Since one of these newly generated configurations must be closer
(4/10)
89. ■ The configuration space generated by the tree search is the state space
graph.
■
Each node of the graph is a configuration and each edge of the graph connects
configurations that can be reached from one another by a single move of a tile.
One method for solving this problem in parallel:
■
■ First, a few levels of configurations starting from the initial configuration are
generated serially until the search tree has a sufficient number of leaf nodes
Now each node is assigned to a task to explore further until at least one of them
finds a solution
As soon as one of the concurrent tasks finds a solution it can inform the others
to terminate their searches.
■
■
■
Exploratory Decomposition-
Example-1 =>15-puzzle problem (3/4)
(5/10)
91. Exploratory vs. data
decomposition (7/10)
■
The tasks induced by data-decomposition are performed
in their entirety & each task performs useful
computations towards the solution of the problem
In exploratory decomposition, unfinished tasks can be
terminated as soon as an overall solution is found.
■
■
■ Portion of the search space searched (& the aggregate amount
of work performed) by a parallel formulation can be different
from that searched by a serial algorithm
The work performed by the parallel formulation can be either
smaller or greater than that performed by the serial algorithm.
92. Exploratory Decomposition-
Example-2 (1/3) (8/10)
■ Example: consider a search space that has been partitioned
into four concurrent tasks as shown in Fig.
If the solution lies right at the beginning of the search
space corresponding to task 3
(Fig. (a)), then it will be found almost immediately by the
parallel formulation.
The serial algorithm would have found the solution only
■
■
■
93. Dr. Hanif Durad 105
Exploratory Decomposition-
Example-2 (2/3) (9/10)
94. Dr. Hanif Durad 106
■ On the other hand, if the solution lies towards the
end of the search space corresponding to task 1
(Fig (b)), then the parallel formulation will
perform almost four times the work of the serial
algorithm and will yield no speedup.
■ This change results in super- or sub-linear
speedups.
Exploratory Decomposition-
Example-2 (3/3) (10/10)
95. 3.2.4 Speculative Decomposition
(1/9)
■ In some applications, dependencies between tasks are not
known a-priori.
For such applications, it is impossible to identify
independent tasks.
When a program may take one of many possible
computationally significant branches depending on the
output of other computations that precede it.
■
■
96. Dr. Hanif Durad 108
■ There are generally two approaches to dealing with
such applications: conservative approaches, which
identify independent tasks only when they are
guaranteed to not have dependencies, and, optimistic
approaches, which schedule tasks even when they may
potentially be erroneous.
Conservative approaches may yield little concurrency
and optimistic approaches may require roll-back
■
Speculative Decomposition
(2/9)
97. Speculative Decomposition
(3/9)
■ This scenario is similar to evaluating one or more of the branches
of a switch statement in C in parallel before the input for the
switch is available.
■
While one task is performing the computation that will eventually resolve
the switch, other tasks could pick up the multiple branches of the switch in
parallel.
When the input for the switch has finally been computed, the computation
corresponding to the correct branch would be used while that
corresponding to the other branches would be discarded.
The parallel run time is smaller than the serial run time by the amount of
■
■
98. Speculative Decomposition
■ This parallel formulation of a switch guarantees at least some
wasteful computation.
In order to minimize the wasted computation, a slightly
different formulation of speculative decomposition could be
used, especially in situations where one of the outcomes of
the switch is more likely than the others.
■
■ In this case, only the most promising branch is taken up a task in
parallel with the preceding computation.
In case the outcome of the switch is different from what was
anticipated, the computation is rolled back and the correct branch of
the switch is taken.
■
■
99. Speculative Decomposition:
Example-1
99
Dr. Hanif Durad
A classic example of speculative decomposition is
in discrete event simulation.
■The central data structure in a discrete event
simulation is a time-ordered event list.
■Events are extracted precisely in time order,
processed, and if required, resulting events are
inserted back into the event list.
100. Speculative Decomposition:
Example-1
100
Dr. Hanif Durad
Consider your day today as a discrete event system - you
get up, get ready, drive to work, work, eat lunch, work
some more, drive back, eat dinner, and sleep.
Each of these events may be processed independently,
however, in driving to work, you might meet with an
unfortunate accident and not get to work at all.
Therefore, an optimistic scheduling of other events will
have to be rolled back.
■
■
101. Speculative Decomposition:
Example-2
101
Dr. Hanif Durad
Another example is the simulation of a network
of nodes (for instance, an assembly line or a
computer network through which packets pass).
The task is to simulate the behavior of this
network for various inputs and node delay
parameters (note that networks may become
unstable for certain values of service rates,
103. Speculative Decomposition:
Example-2
■ The problem of simulating a sequence of input jobs on the
network described in this example appears inherently
sequential because the input of a typical component is the
output of another
However, we can define speculative tasks that start simulating
a subpart of the network, each assuming one of several possible
inputs to that stage.
When an actual input to a certain stage becomes available (as a
result of the completion of another selector task from a previous
stage),
■
■
■ then all or part of the work required to simulate this input would have
already been finished if the speculation was correct,
or the simulation of this stage is restarted with the most recent correct
input if the speculation was incorrect.
■
104. Speculative vs. exploratory
decomposition
■ In exploratory decomposition
■
the output of the multiple tasks originating at a branch is unknown.
the serial algorithm may explore different alternatives one after the other,
because the branch that may lead to the solution is not known beforehand
=> the parallel program may perform more, less, or the same amount of
aggregate work compared to the serial algorithm depending on the location of
the solution in the search space.
■
■ In speculative decomposition
■
■ the input at a branch leading to multiple parallel tasks is unknown the
serial algorithm would strictly perform only one of the tasks at a
speculative stage because when it reaches the beginning of that stage, it
knows exactly which branch to take.
=> a parallel program employing speculative decomposition performs more
105. 3.2.5 Hybrid Decompositions
■ Decomposition techs are not exclusive, and can
often be combined together.
■ Often, a computation is structured into multiple
stages and it is sometimes necessary to apply
different types of decomposition in different stages.
106. Hybrid Decompositions
Example 1: Finding the minimum.
■Example 1: while finding the minimum of a
large set of n numbers,
■ a purely recursive decomposition may result in far
more tasks than the number of processes, P, available.
■ An efficient decomposition would partition the input
into P roughly equal parts and have each task
compute the minimum of the sequence assigned to it.
107. Dr. Hanif Durad 119
Hybrid Decompositions-Finding the
minimum.
108. Hybrid Decompositions-
Example 2: Quicksort in parallel.
■ recursive decomposition have been used in quicksort.
■
■
■ This formulation results in O(n) tasks for the problem of sorting a
sequence of size n.
But due to the dependencies among these tasks and due to uneven
sizes of the tasks, the effective concurrency is quite limited.
For example, the first task for splitting the input list into two parts
takes O(n) time, which puts an upper limit on the performance
gain possible via parallelization.
■ The step of splitting lists performed by tasks in parallel
quicksort can also be decomposed using the input
decomposition technique.
■The resulting hybrid decomposition that combines recursive
109.
110.
111.
112.
113. Input Data Decomposition
• Generally applicable if each output can be naturally computed
as a function of the input.
In many cases, this is the only natural decomposition because
the output is not clearly known a-priori (e.g., the problem of
finding the minimum in a list, sorting a given list, etc.).
• A task is associated with each input data partition. The task
performs as much of the computation with its part of the data.
Subsequent processing combines these partial results.
114. Exploratory Decomposition
In many cases, the decomposition of the problem
goes hand in-hand with its execution.
• These problems typically involve the exploration
(search) of a state space of solutions.
• Problems in this class include a variety of discrete
optimization problems (0/1 integer programming,
QAP, etc.),theorem proving, game playing, etc.
115.
116.
117. Speculative Decomposition: Example
Another example is the simulation of a network of nodes
(for instance, an assembly line or a computer network
through which packets pass). The task is to simulate the
behavior of this network for various inputs and node
delay parameters (note that networks may become
unstable for certain values of service rates, queue sizes,
etc.).
118. Hybrid Decompositions
Often, a mix of decomposition techniques is necessary for
decomposing a problem. Consider the following examples:
• In quicksort, recursive decomposition alone limits concurrency (Why?). A
mix of data and recursive decompositions is more desirable.
• In discrete event simulation, there might be concurrency in task processing.
A mix of speculative decomposition and data decomposition may work well.
• Even for simple problems like finding a minimum of a list of numbers, a mix
of data and recursive decomposition works well.
120. Characteristics of Tasks and Interactions
• We shall discuss the various properties of tasks and inter-
task interactions that affect the choice of a good mapping.
• Characteristics of Tasks
• Task Generation
• Task Sizes
• Knowledge of Task Sizes
• Size of Data Associated with Tasks
148. Task Generation
• Static task generation: Concurrent tasks can be
identified a-priori. Typical matrix operations, graph
algorithms, image processing applications, and other
regularly structured problems fall in this class. These can
typically be decomposed using data or recursive
decomposition techniques.
• Dynamic task generation: Tasks are generated as we
perform computation. A classic example of this is in
game playing - each 15 puzzle board is generated from
the previous one. These applications are typically
decomposed using exploratory or speculative
decompositions.
149.
150.
151.
152. Task Sizes
• Task sizes may be uniform (i.e., all tasks are the same
size) or non-uniform.
• Non-uniform task sizes may be such that they can be
determined (or estimated) a-priori or not.
• Examples in this class include discrete optimization
problems, in which it is difficult to estimate the effective
size of a state space.
153. Size of Data Associated with Tasks
• The size of data associated with a task may be small or
large when viewed in the context of the size of the task.
• A small context of a task implies that an algorithm can
easily communicate this task to other processes
dynamically (e.g., the 15 puzzle).
• A large context ties the task to a process, or alternately,
an algorithm may attempt to reconstruct the context at
another processes as opposed to communicating the
context of the task (e.g., 0/1 integer programming).
154. Characteristics of Task Interactions
• Tasks may communicate with each other in various
ways. The associated dichotomy is:
• Static interactions: The tasks and their interactions are
known a-priori. These are relatively simpler to code into
programs.
• Dynamic interactions: The timing or interacting tasks
cannot be determined a-priori. These interactions are
harder to code, especitally, as we shall see, using
message passing APIs.
155. Characteristics of Task Interactions
• Regular interactions: There is a definite pattern (in the
graph sense) to the interactions. These patterns can be
exploited for efficient implementation.
• Irregular interactions: Interactions lack well-defined
topologies.
156. Characteristics of Task Interactions: Example
A simple example of a regular static interaction pattern is in image
dithering. The underlying communication pattern is a structured (2-D
mesh) one as shown here:
157. Characteristics of Task Interactions: Example
The multiplication of a sparse matrix with a vector is a good example
of a static irregular interaction pattern. Here is an example of a
sparse matrix and its associated interaction pattern.
158. Characteristics of Task Interactions
• Interactions may be read-only or read-write.
• In read-only interactions, tasks just read data items
associated with other tasks.
• In read-write interactions tasks read, as well as modily
data items associated with other tasks.
• In general, read-write interactions are harder to code,
since they require additional synchronization primitives.
159. Characteristics of Task Interactions
• Interactions may be one-way or two-way.
• A one-way interaction can be initiated and accomplished
by one of the two interacting tasks.
• A two-way interaction requires participation from both
tasks involved in an interaction.
• One way interactions are somewhat harder to code in
message passing APIs.
160. Mapping Techniques
• Once a problem has been decomposed into concurrent
tasks, these must be mapped to processes (that can be
executed on a parallel platform).
• Mappings must minimize overheads.
• Primary overheads are communication and idling.
• Minimizing these overheads often represents
contradicting objectives.
• Assigning all work to one processor trivially minimizes
communication at the expense of significant idling.
161. Mapping Techniques for Minimum Idling
Mapping must simultaneously minimize idling and load balance.
Merely balancing load does not minimize idling.
162. Mapping Techniques for Minimum Idling
Mapping techniques can be static or dynamic.
• Static Mapping: Tasks are mapped to processes a-priori.
For this to work, we must have a good estimate of the
size of each task. Even in these cases, the problem may
be NP complete.
• Dynamic Mapping: Tasks are mapped to processes at
runtime. This may be because the tasks are generated
at runtime, or that their sizes are not known.
163. Schemes for Static Mapping
• Mappings based on data partitioning.
• Mappings based on task graph partitioning.
• Hybrid mappings.
164. Mappings Based on Data Partitioning
We can combine data partitioning with the ``owner-computes'' rule to
partition the computation into subtasks. The simplest data
decomposition schemes for dense matrices are 1-D block
distribution schemes.
165. Block Array Distribution Schemes
Block distribution schemes can be generalized to
higher dimensions as well.
166. Block Array Distribution Schemes: Examples
• For multiplying two dense matrices A and B, we can
partition the output matrix C using a block
decomposition.
• For load balance, we give each task the same number of
elements of C. (Note that each element of C
corresponds to a single dot product.)
• The choice of precise decomposition (1-D or 2-D) is
determined by the associated communication overhead.
• In general, higher dimension decomposition allows the
use of larger number of processes.
168. Cyclic and Block Cyclic Distributions
• If the amount of computation associated with data items
varies, a block decomposition may lead to significant
load imbalances.
• A simple example of this is in LU decomposition (or
Gaussian Elimination) of dense matrices.
169. LU Factorization of a Dense Matrix
A decomposition of LU factorization into 14 tasks - notice the
significant load imbalance.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
170. Block Cyclic Distributions
• Variation of the block distribution scheme that can be used to
alleviate the load-imbalance and idling problems.
• Partition an array into many more blocks than the number of
available processes.
• Blocks are assigned to processes in a round-robin manner so that
each process gets several non-adjacent blocks.
171. Block-Cyclic Distribution for Gaussian
Elimination
The active part of the matrix in Gaussian Elimination changes.
By assigning blocks in a block-cyclic fashion, each processor
receives blocks from different parts of the matrix.
173. Block-Cyclic Distribution
• A cyclic distribution is a special case in which block size is one.
• A block distribution is a special case in which block size is n/p ,
where n is the dimension of the matrix and p is the number of
processes.
174. Graph Partitioning Dased Data Decomposition
• In case of sparse matrices, block decompositions are
more complex.
• Consider the problem of multiplying a sparse matrix with
a vector.
• The graph of the matrix is a useful indicator of the work
(number of nodes) and communication (the degree of
each node).
• In this case, we would like to partition the graph so as to
assign equal number of nodes to each process, while
minimizing edge count of the graph partition.
175. Partitioning the Graph of Lake Superior
Random Partitioning
Partitioning for minimum edge-cut.
176. Mappings Based on Task Paritioning
• Partitioning a given task-dependency graph across
processes.
• Determining an optimal mapping for a general task-
dependency graph is an NP-complete problem.
• Excellent heuristics exist for structured graphs.
177. Task Paritioning: Mapping a Binary Tree
Dependency Graph
Example illustrates the dependency graph of one view of quick-sort
and how it can be assigned to processes in a hypercube.
178. Task Paritioning: Mapping a Sparse Graph
Sparse graph for computing a sparse matrix-vector product and
its mapping.
179. Hierarchical Mappings
• Sometimes a single mapping technique is inadequate.
• For example, the task mapping of the binary tree
(quicksort) cannot use a large number of processors.
• For this reason, task mapping can be used at the top
level and data partitioning within each level.
180. An example of task partitioning at top level with data
partitioning at the lower level.
181. Schemes for Dynamic Mapping
• Dynamic mapping is sometimes also referred to as
dynamic load balancing, since load balancing is the
primary motivation for dynamic mapping.
• Dynamic mapping schemes can be centralized or
distributed.
182. Centralized Dynamic Mapping
• Processes are designated as masters or slaves.
• When a process runs out of work, it requests the master
for more work.
• When the number of processes increases, the master
may become the bottleneck.
• To alleviate this, a process may pick up a number of
tasks (a chunk) at one time. This is called Chunk
scheduling.
• Selecting large chunk sizes may lead to significant load
imbalances as well.
183. Distributed Dynamic Mapping
• Each process can send or receive work from other
processes.
• This alleviates the bottleneck in centralized schemes.
• There are four critical questions: how are sensing and
receiving processes paired together, who initiates work
transfer, how much work is transferred, and when is a
transfer triggered?
• Answers to these questions are generally application
specific. We will look at some of these techniques later in
this class.
184. Minimizing Interaction Overheads
• Maximize data locality: Where possible, reuse
intermediate data. Restructure computation so that data
can be reused in smaller time windows.
• Minimize volume of data exchange: There is a cost
associated with each word that is communicated. For
this reason, we must minimize the volume of data
communicated.
• Minimize frequency of interactions: There is a startup
cost associated with each interaction. Therefore, try to
merge multiple interactions to one, where possible.
185. Minimizing Interaction Overheads (continued)
• Overlapping computations with interactions: Use non-
blocking communications, multithreading, and
prefetching to hide latencies.
• Replicating data or computations.
• Using group communications instead of point-to-point
primitives.
• Overlap interactions with other interactions.
186. Parallel Algorithm Models
An algorithm model is a way of structuring a parallel
algorithm by selecting a decomposition and mapping
technique and applying the appropriate strategy to
minimize interactions.
• Data Parallel Model: Tasks are statically (or semi-
statically) mapped to processes and each task performs
similar operations on different data.
• Task Graph Model: Starting from a task dependency
graph, the interrelationships among the tasks are utilized
187. Parallel Algorithm Models (continued)
• Master-Slave Model: One or more processes generate
work and allocate it to worker processes. This allocation
may be static or dynamic.
• Pipeline / Producer-Comsumer Model: A stream of data
is passed through a succession of processes, each of
which perform some task on it.
• Hybrid Models: A hybrid model may be composed either
of multiple models applied hierarchically or multiple
models applied sequentially to different phases of a
parallel algorithm.