The document discusses parallel processing and matrix multiplication. It introduces parallel processing concepts like dividing a task between multiple processors to complete it faster. As an example, it explains how two people can add 100 numbers in half the time it takes one person. It then discusses using parallel processing to compute the convex hull of a set of points by dividing the set in half and merging the results. The rest of the document focuses on computational models for parallel processing like PRAM and different types of PRAM models including EREW, CREW, CRCW and how they handle read/write conflicts. It also provides an example of using parallel processing to perform matrix multiplication faster by dividing the matrices and merging the results.
Lambda calculus (also written as λ-calculus) is a formal system in mathematical logic for expressing computation based on function abstraction and application using variable binding and substitution. It is a universal model of computation that can be used to simulate any Turing machine and was first introduced by mathematician Alonzo Church in the 1930s as part of his research of the foundations of mathematics.
Lambda calculus consists of constructing lambda terms and performing reduction operations on them
The document discusses architectural support for high-level languages in ARM processors. It covers various topics like abstraction, data types, expressions, conditional statements, loops, and functions/procedures. For data types, it describes ARM's support for C data types like characters, integers, floating-point numbers, and derived types. It also discusses how expressions are efficiently evaluated using registers. Conditional statements like if/else and switches are supported. Loops like for and while can be implemented efficiently. Finally, it describes ARM's Procedure Call Standard which defines conventions for calling functions and passing arguments.
Describes ARM7-TDMI Processor Instruction Set. Explains classes of ARM7 instructions, syntax of data processing instructions, branch instructions, load-store instructions, coprocessor instructions, thumb state instructions.
Introduction to Dynamic Programming, Principle of OptimalityBhavin Darji
Introduction
Dynamic Programming
How Dynamic Programming reduces computation
Steps in Dynamic Programming
Dynamic Programming Properties
Principle of Optimality
Problem solving using Dynamic Programming
The document provides an overview of compilers by discussing:
1. Compilers translate source code into executable target code by going through several phases including lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation.
2. An interpreter directly executes source code statement by statement while a compiler produces target code as translation. Compiled code generally runs faster than interpreted code.
3. The phases of a compiler include a front end that analyzes the source code and produces intermediate code, and a back end that optimizes and generates the target code.
This document summarizes graph coloring using backtracking. It defines graph coloring as minimizing the number of colors used to color a graph. The chromatic number is the fewest colors needed. Graph coloring is NP-complete. The document outlines a backtracking algorithm that tries assigning colors to vertices, checks if the assignment is valid (no adjacent vertices have the same color), and backtracks if not. It provides pseudocode for the algorithm and lists applications like scheduling, Sudoku, and map coloring.
The document discusses the Thumb instruction set of the ARM7TDMI processor, which uses 16-bit instructions as a more compact alternative to the standard 32-bit ARM instruction set. It describes how Thumb instructions are dynamically decompressed into ARM instructions, and how the processor can switch between ARM and Thumb modes using BX instructions. It also summarizes the key features of the Thumb instruction set, including differences from ARM like restricted register access, smaller immediate values and instruction formats optimized for code size over performance.
Lambda calculus (also written as λ-calculus) is a formal system in mathematical logic for expressing computation based on function abstraction and application using variable binding and substitution. It is a universal model of computation that can be used to simulate any Turing machine and was first introduced by mathematician Alonzo Church in the 1930s as part of his research of the foundations of mathematics.
Lambda calculus consists of constructing lambda terms and performing reduction operations on them
The document discusses architectural support for high-level languages in ARM processors. It covers various topics like abstraction, data types, expressions, conditional statements, loops, and functions/procedures. For data types, it describes ARM's support for C data types like characters, integers, floating-point numbers, and derived types. It also discusses how expressions are efficiently evaluated using registers. Conditional statements like if/else and switches are supported. Loops like for and while can be implemented efficiently. Finally, it describes ARM's Procedure Call Standard which defines conventions for calling functions and passing arguments.
Describes ARM7-TDMI Processor Instruction Set. Explains classes of ARM7 instructions, syntax of data processing instructions, branch instructions, load-store instructions, coprocessor instructions, thumb state instructions.
Introduction to Dynamic Programming, Principle of OptimalityBhavin Darji
Introduction
Dynamic Programming
How Dynamic Programming reduces computation
Steps in Dynamic Programming
Dynamic Programming Properties
Principle of Optimality
Problem solving using Dynamic Programming
The document provides an overview of compilers by discussing:
1. Compilers translate source code into executable target code by going through several phases including lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation.
2. An interpreter directly executes source code statement by statement while a compiler produces target code as translation. Compiled code generally runs faster than interpreted code.
3. The phases of a compiler include a front end that analyzes the source code and produces intermediate code, and a back end that optimizes and generates the target code.
This document summarizes graph coloring using backtracking. It defines graph coloring as minimizing the number of colors used to color a graph. The chromatic number is the fewest colors needed. Graph coloring is NP-complete. The document outlines a backtracking algorithm that tries assigning colors to vertices, checks if the assignment is valid (no adjacent vertices have the same color), and backtracks if not. It provides pseudocode for the algorithm and lists applications like scheduling, Sudoku, and map coloring.
The document discusses the Thumb instruction set of the ARM7TDMI processor, which uses 16-bit instructions as a more compact alternative to the standard 32-bit ARM instruction set. It describes how Thumb instructions are dynamically decompressed into ARM instructions, and how the processor can switch between ARM and Thumb modes using BX instructions. It also summarizes the key features of the Thumb instruction set, including differences from ARM like restricted register access, smaller immediate values and instruction formats optimized for code size over performance.
SA is a global optimization technique.
It distinguishes between different local optima.
It is a memory less algorithm & the algorithm does not use any information gathered during the search.
SA is motivated by an analogy to annealing in solids.
& it is an iterative improvement algorithm.
The document discusses the equivalence between context-free grammars (CFGs) and pushdown automata (PDAs). It states that for any CFG, an equivalent PDA can be constructed to accept the language generated by the grammar, and vice versa. This allows a programming language to be specified by a CFG and implemented with a PDA in a compiler. The document also provides procedures for converting between CFGs and PDAs, including an example of constructing a PDA from a given CFG.
Definition
Embedded systems vs. General Computing Systems
Core of the Embedded System
Memory
Sensors and Actuators
Communication Interface
Embedded Firmware
Other System Components
PCB and Passive Components
This document discusses the history and characteristics of CISC and RISC architectures. It describes how CISC architectures were developed in the 1950s-1970s to address hardware limitations at the time by allowing instructions to perform multiple operations. RISC architectures emerged in the late 1970s-1980s as hardware improved, focusing on simpler instructions that could be executed faster through pipelining. Common RISC and CISC processors used commercially are also outlined.
The document discusses the halting problem and decidability of languages. It defines a decidable language as one where a Turing machine can accept or reject every input string and halt. The halting problem asks whether it is possible to determine if an arbitrary Turing machine will halt on a given input. The document proves the halting problem is undecidable by constructing a machine that contradicts itself, showing no algorithm can correctly determine if all Turing machines halt on all inputs.
The document discusses error detection and recovery in compilers. It describes how compilers should detect various types of errors and attempt to recover from them to continue processing the program. It covers lexical, syntactic and semantic errors and different strategies compilers can use for error recovery like insertion, deletion or replacement of tokens. It also discusses properties of good error reporting and handling shift-reduce conflicts.
Single source Shortest path algorithm with exampleVINITACHAUHAN21
The document discusses greedy algorithms and their application to solving optimization problems. It provides an overview of greedy algorithms and explains that they make locally optimal choices at each step in the hope of finding a globally optimal solution. One application discussed is the single source shortest path problem, which can be solved using Dijkstra's algorithm. Dijkstra's algorithm is presented as a greedy approach that runs in O(V2) time for a graph with V vertices. An example of applying Dijkstra's algorithm to find shortest paths from a source node in a graph is provided.
The document discusses the applications of microprocessors. It explains that microprocessors are used as the central processing unit in microcomputers to perform computing tasks and make decisions. Microprocessors are commonly used in embedded systems and reactive systems to control external hardware and events in applications like consumer electronics, home appliances, automotive systems, medical instrumentation, industrial automation, communication devices, and more. The document provides examples of microprocessors being used for functions like speed control of motors, traffic light control, instrument measurement, appliance operation, building automation, and other control systems.
Keypad is a common interface with any microcontroller. This presentation gives details of keypad can be interfaced with 8051. The key pressed may be dispalyed on LCD/7 segment/LED displays.
This document discusses fuzzy rules and fuzzy implications. It begins by defining a fuzzy rule as a conditional statement where the variables are linguistic and determined by fuzzy sets. It then contrasts classical rules, which use binary logic, to fuzzy rules, where variables can take intermediate values. An example shows classical speed rules mapped to fuzzy rules using linguistic variables like "fast" and "slow". The document goes on to explain different interpretations of fuzzy rules and implications, like Zadeh's Max-Min rule for fuzzy implications. It concludes by outlining the four major parts of a fuzzy controller: rules formation, aggregation, implication, and defuzzification.
This document provides an introduction to greedy algorithms. It defines greedy algorithms as algorithms that make locally optimal choices at each step in the hope of finding a global optimum. The document then provides examples of problems that can be solved using greedy algorithms, including counting money, scheduling jobs, finding minimum spanning trees, and the traveling salesman problem. It also provides pseudocode for a general greedy algorithm and discusses some properties of greedy algorithms.
The document discusses symbol tables, which are data structures used by compilers to track semantic information about identifiers, variables, functions, classes, etc. It provides details on:
- How various compiler phases like lexical analysis, syntax analysis, semantic analysis, code generation utilize and update the symbol table.
- Common data structures used to implement symbol tables like linear lists, hash tables and how they work.
- The information typically stored for different symbols like name, type, scope, memory location etc.
- Organization of symbol tables for block-structured vs non-block structured languages, including using multiple nested tables vs a single global table.
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...Simplilearn
This presentation on 'What Is Dynamic Programming?' will acquaint you with a clear understanding of how this programming paradigm works with the help of a real-life example. In this Dynamic Programming Tutorial, you will understand why recursion is not compatible and how you can solve the problems involved in recursion using DP. Finally, we will cover the dynamic programming implementation of the Fibonacci series program. So, let's get started!
The topics covered in this presentation are:
1. Introduction
2. Real-Life Example of Dynamic Programming
3. Introduction to Dynamic Programming
4. Dynamic Programming Interpretation of Fibonacci Series Program
5. How Does Dynamic Programming Work?
What Is Dynamic Programming?
In computer science, something is said to be efficient if it is quick and uses minimal memory. By storing the solutions to subproblems, we can quickly look them up if the same problem arises again. Because there is no need to recompute the solution, this saves a significant amount of calculation time. But hold on! Efficiency comprises both time and space difficulty. But, why does it matter if we reduce the time required to solve the problem only to increase the space required? This is why it is critical to realize that the ultimate goal of Dynamic Programming is to obtain considerably quicker calculation time at the price of a minor increase in space utilized. Dynamic programming is defined as an algorithmic paradigm that solves a given complex problem by breaking it into several sub-problems and storing the results of those sub-problems to avoid the computation of the same sub-problem over and over again.
What is Programming?
Programming is an act of designing, developing, deploying an executlable software solution to the given user-defined problem.
Programming involves the following stages.
- Problem Statement
- Algorithms and Flowcharts
- Coding the program
- Debug the program.
- Documention
- Maintainence
Simplilearn’s Python Training Course is an all-inclusive program that will introduce you to the Python development language and expose you to the essentials of object-oriented programming, web development with Django and game development. Python has surpassed Java as the top language used to introduce U.S.
Learn more at: https://www.simplilearn.com/mobile-and-software-development/python-development-training
- Thumb is a 16-bit instruction set extension to the 32-bit ARM architecture that provides higher code density and smaller memory requirements compared to standard ARM code.
- Thumb instructions are 16-bits wide while ARM instructions are 32-bits wide, allowing Thumb code to be half the size of equivalent ARM code.
- Thumb code executes on ARM processors by decompressing Thumb instructions into their 32-bit ARM equivalents on the processor.
Survival of the Fittest: Using Genetic Algorithm for Data Mining OptimizationOr Levi
Presented at the eBay Inc Data Conference 2013:
“Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization”
Showed a Genetic Algorithm based method to optimize cluster analysis and developed a demo, applying this algorithm, for grouping similar items on eBay into a catalog of unique products.
SA is a global optimization technique.
It distinguishes between different local optima.
It is a memory less algorithm & the algorithm does not use any information gathered during the search.
SA is motivated by an analogy to annealing in solids.
& it is an iterative improvement algorithm.
The document discusses the equivalence between context-free grammars (CFGs) and pushdown automata (PDAs). It states that for any CFG, an equivalent PDA can be constructed to accept the language generated by the grammar, and vice versa. This allows a programming language to be specified by a CFG and implemented with a PDA in a compiler. The document also provides procedures for converting between CFGs and PDAs, including an example of constructing a PDA from a given CFG.
Definition
Embedded systems vs. General Computing Systems
Core of the Embedded System
Memory
Sensors and Actuators
Communication Interface
Embedded Firmware
Other System Components
PCB and Passive Components
This document discusses the history and characteristics of CISC and RISC architectures. It describes how CISC architectures were developed in the 1950s-1970s to address hardware limitations at the time by allowing instructions to perform multiple operations. RISC architectures emerged in the late 1970s-1980s as hardware improved, focusing on simpler instructions that could be executed faster through pipelining. Common RISC and CISC processors used commercially are also outlined.
The document discusses the halting problem and decidability of languages. It defines a decidable language as one where a Turing machine can accept or reject every input string and halt. The halting problem asks whether it is possible to determine if an arbitrary Turing machine will halt on a given input. The document proves the halting problem is undecidable by constructing a machine that contradicts itself, showing no algorithm can correctly determine if all Turing machines halt on all inputs.
The document discusses error detection and recovery in compilers. It describes how compilers should detect various types of errors and attempt to recover from them to continue processing the program. It covers lexical, syntactic and semantic errors and different strategies compilers can use for error recovery like insertion, deletion or replacement of tokens. It also discusses properties of good error reporting and handling shift-reduce conflicts.
Single source Shortest path algorithm with exampleVINITACHAUHAN21
The document discusses greedy algorithms and their application to solving optimization problems. It provides an overview of greedy algorithms and explains that they make locally optimal choices at each step in the hope of finding a globally optimal solution. One application discussed is the single source shortest path problem, which can be solved using Dijkstra's algorithm. Dijkstra's algorithm is presented as a greedy approach that runs in O(V2) time for a graph with V vertices. An example of applying Dijkstra's algorithm to find shortest paths from a source node in a graph is provided.
The document discusses the applications of microprocessors. It explains that microprocessors are used as the central processing unit in microcomputers to perform computing tasks and make decisions. Microprocessors are commonly used in embedded systems and reactive systems to control external hardware and events in applications like consumer electronics, home appliances, automotive systems, medical instrumentation, industrial automation, communication devices, and more. The document provides examples of microprocessors being used for functions like speed control of motors, traffic light control, instrument measurement, appliance operation, building automation, and other control systems.
Keypad is a common interface with any microcontroller. This presentation gives details of keypad can be interfaced with 8051. The key pressed may be dispalyed on LCD/7 segment/LED displays.
This document discusses fuzzy rules and fuzzy implications. It begins by defining a fuzzy rule as a conditional statement where the variables are linguistic and determined by fuzzy sets. It then contrasts classical rules, which use binary logic, to fuzzy rules, where variables can take intermediate values. An example shows classical speed rules mapped to fuzzy rules using linguistic variables like "fast" and "slow". The document goes on to explain different interpretations of fuzzy rules and implications, like Zadeh's Max-Min rule for fuzzy implications. It concludes by outlining the four major parts of a fuzzy controller: rules formation, aggregation, implication, and defuzzification.
This document provides an introduction to greedy algorithms. It defines greedy algorithms as algorithms that make locally optimal choices at each step in the hope of finding a global optimum. The document then provides examples of problems that can be solved using greedy algorithms, including counting money, scheduling jobs, finding minimum spanning trees, and the traveling salesman problem. It also provides pseudocode for a general greedy algorithm and discusses some properties of greedy algorithms.
The document discusses symbol tables, which are data structures used by compilers to track semantic information about identifiers, variables, functions, classes, etc. It provides details on:
- How various compiler phases like lexical analysis, syntax analysis, semantic analysis, code generation utilize and update the symbol table.
- Common data structures used to implement symbol tables like linear lists, hash tables and how they work.
- The information typically stored for different symbols like name, type, scope, memory location etc.
- Organization of symbol tables for block-structured vs non-block structured languages, including using multiple nested tables vs a single global table.
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...Simplilearn
This presentation on 'What Is Dynamic Programming?' will acquaint you with a clear understanding of how this programming paradigm works with the help of a real-life example. In this Dynamic Programming Tutorial, you will understand why recursion is not compatible and how you can solve the problems involved in recursion using DP. Finally, we will cover the dynamic programming implementation of the Fibonacci series program. So, let's get started!
The topics covered in this presentation are:
1. Introduction
2. Real-Life Example of Dynamic Programming
3. Introduction to Dynamic Programming
4. Dynamic Programming Interpretation of Fibonacci Series Program
5. How Does Dynamic Programming Work?
What Is Dynamic Programming?
In computer science, something is said to be efficient if it is quick and uses minimal memory. By storing the solutions to subproblems, we can quickly look them up if the same problem arises again. Because there is no need to recompute the solution, this saves a significant amount of calculation time. But hold on! Efficiency comprises both time and space difficulty. But, why does it matter if we reduce the time required to solve the problem only to increase the space required? This is why it is critical to realize that the ultimate goal of Dynamic Programming is to obtain considerably quicker calculation time at the price of a minor increase in space utilized. Dynamic programming is defined as an algorithmic paradigm that solves a given complex problem by breaking it into several sub-problems and storing the results of those sub-problems to avoid the computation of the same sub-problem over and over again.
What is Programming?
Programming is an act of designing, developing, deploying an executlable software solution to the given user-defined problem.
Programming involves the following stages.
- Problem Statement
- Algorithms and Flowcharts
- Coding the program
- Debug the program.
- Documention
- Maintainence
Simplilearn’s Python Training Course is an all-inclusive program that will introduce you to the Python development language and expose you to the essentials of object-oriented programming, web development with Django and game development. Python has surpassed Java as the top language used to introduce U.S.
Learn more at: https://www.simplilearn.com/mobile-and-software-development/python-development-training
- Thumb is a 16-bit instruction set extension to the 32-bit ARM architecture that provides higher code density and smaller memory requirements compared to standard ARM code.
- Thumb instructions are 16-bits wide while ARM instructions are 32-bits wide, allowing Thumb code to be half the size of equivalent ARM code.
- Thumb code executes on ARM processors by decompressing Thumb instructions into their 32-bit ARM equivalents on the processor.
Survival of the Fittest: Using Genetic Algorithm for Data Mining OptimizationOr Levi
Presented at the eBay Inc Data Conference 2013:
“Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization”
Showed a Genetic Algorithm based method to optimize cluster analysis and developed a demo, applying this algorithm, for grouping similar items on eBay into a catalog of unique products.
This document provides an introduction to genetic algorithms, which are a class of computational models inspired by evolution. It describes how genetic algorithms use processes analogous to natural selection and genetics to arrive at optimal solutions to problems. The document outlines the key components of genetic algorithms, including representing potential solutions as binary strings, selecting parents based on fitness, recombining parents via crossover to create offspring, mutating offspring randomly, and replacing the population with the offspring. The goal is to evolve better and better solutions over many generations through these evolutionary processes of selection, recombination and mutation.
This document discusses the key characteristics of computer memory, including location, capacity, unit of transfer, access methods, performance, physical type, physical characteristics, and organization. It covers different types of memory like CPU registers, main memory, cache, disk, and tape. The different access methods like sequential, direct, random, and associative access are explained. The memory hierarchy and performance aspects like access time, memory cycle time, and transfer rate are defined. Factors like cache size, mapping function, replacement algorithm, write policy, block size that impact cache performance are also summarized.
Presentation is about genetic algorithms. Also it includes introduction to soft computing and hard computing. Hope it serves the purpose and be useful for reference.
This presentation is intended for giving an introduction to Genetic Algorithm. Using an example, it explains the different concepts used in Genetic Algorithm. If you are new to GA or want to refresh concepts , then it is a good resource for you.
This document provides an introduction to genetic algorithms (GAs) including:
- GAs are inspired by Darwin's theory of evolution and use techniques like inheritance, mutation, selection, and crossover to find solutions to optimization problems.
- The document discusses key GA components like populations of individuals, fitness functions, selection, crossover, and mutation.
- Examples of GA applications to energy management problems are presented including categorizing appliances, pricing schemes, and parametrizing a GA for scheduling home appliances.
This document discusses parallel matrix multiplication algorithms on the Parallel Random Access Machine (PRAM) model. It describes algorithms that multiply matrices using different numbers of processors, from n3 processors down to n2 processors. The time complexity is O(log n) in all cases, while the processor and work complexities vary based on the number of processors. Block matrix multiplication is also introduced as a more efficient approach for shared memory machines by improving data locality.
The document provides an overview of basic concepts related to parallelization and data locality optimization. It discusses loop-level parallelism as a major target for parallelization, especially in applications using arrays. Long running applications tend to have large arrays, which lead to loops that have many iterations that can be divided among processors. The document also covers data locality and how the organization of computations can significantly impact performance by improving cache usage. It introduces concepts like symmetric multiprocessors and affine transform theory that are useful for parallelization and locality optimizations.
This document provides an overview of the "High Performance Computing" module, including its syllabus, aim, objectives, outcomes, contents, and descriptions of parallel algorithms and parallel programming languages. The module covers parallel algorithms and their complexity analyses, models of computation, and shared memory parallel programming. It also discusses selection sort, merge sort, searching algorithms, and parallel programming languages like OpenMP. The overall document provides foundational information on key concepts in high performance and parallel computing.
This document discusses techniques for optimizing parallel programs for locality. It covers several key topics:
- Symmetric multiprocessors are a popular parallel architecture where all processors can access shared memory uniformly. Cache coherence protocols allow caches to share data while reading.
- Parallel programs must have good data locality to perform well. Techniques like grouping related operations on the same processor and executing code in succession on processors can improve locality.
- Loop-level parallelism is a common target for parallelization. Dividing loop iterations evenly across processors can achieve good parallelism if iterations are independent.
The document discusses the growing performance gap between processors and memory systems and proposes several "intelligent memory" architectures to address this issue. It describes Active Pages, Parallel Processing RAM (PPRAM), and Intelligent RAM (IRAM) models that integrate processing capabilities into memory. For each model, it provides details on the architectural design and evaluates performance based on simulation results or mathematical predictions, finding speedups ranging from 1 to 1000 times compared to conventional memory systems.
This document provides an outline for the course CS-416 Parallel and Distributed Systems. It discusses key topics like parallel computing concepts, architectures, algorithms, and programming environments. Parallel computing involves using multiple compute resources simultaneously by breaking a problem into discrete parts that can execute concurrently on different processors. The main types of parallel processes are sequential and parallel. Parallelism is useful for solving huge complex problems faster using techniques like decomposition, data parallelism, and task parallelism. Popular parallel programming environments include MPI, OpenMP, and hybrid models.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses Parallel Random Access Machines (PRAM), a parallel computing model where multiple processors share a single memory space. It describes the PRAM model, including different types like EREW, ERCW, CREW, and CRCW. It also summarizes approaches to parallel programming like shared memory, message passing, and data parallel models. The shared memory model emphasizes control parallelism over data parallelism, while message passing is commonly used in distributed memory systems. Data parallel programming focuses on performing operations on data sets simultaneously across processors.
Accelerating economics: how GPUs can save you time and moneyLaurent Oberholzer
Graphics processing units - or GPUs as they are more commonly known - are specialized circuits historically designed to efficiently handle computer graphics. They are highly parallel computers which can process large amounts of data simultaneously. The graphics algorithms for which GPUs have been designed and optimized share characteristics with other algorithms used in high-performance computing. For certain well-suited scientific applications, the GPU's infrastructure has been shown to achieve substantial speedups. For example, the evaluation of the Black-Scholes partial differential equation to price financial options has been found to be performed nearly 200 times faster in parallel on a GPU than serially on a single-core CPU (Buck 2006, “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”).
The main goal of this study is to illustrate how hybrid CPU/GPU systems can be used within computational economics to decrease the execution time of an implementation of a particular model. We start with a mainstream implementation of Raberto et al.'s (2001) Genoa Articial Stock Market ("GASM"), an agent-based model which simulates a financial market in discrete time in which heterogeneous agents trade a single asset. In order to ensure that it is well-suited for execution on the GPU, the algorithm used to clear the market according to the authors' specified mechanism is given a particular attention. Existing parallel programming interfaces - in particular the OpenACC standard and Thrust parallel algorithms library - are then deployed in the code. We aim to show:
- how the codebase of our GASM implementation is adapted to utilize these technologies;
- how incrementally offloading work to the GPU affects the execution time of our model;
- how this speedup varies as a function of the problem size (e.g. number of agents, number of time steps, etc.), i.e. weak scaling; and
- how parameterizing the work distribution within the OpenACC programming model to increase the number of execution units used impacts this speedup, i.e. strong scaling.
This study also aims at giving the reader a working knowledge of GPU-based parallel computing, and when and how it should be used.
This document describes a high-speed SDR SDRAM controller with adaptive bank management. It introduces SDR SDRAM organization and operation, including memory structure, interface, command timings, and refresh requirements. The controller design includes control, generation, and data modules to simplify the SDR interface. Adaptive bank management keeps banks open between transactions, reducing command overhead by 42% compared to a non-adaptive design. The controller was implemented on an FPGA and achieved transfer rates up to 4Gbps at 125MHz with improved memory efficiency.
The document discusses load balancing algorithms for cluster computing environments. It proposes a fully centralized and partially distributed algorithm (FCPDA) that dynamically maps jobs to communicators (groups of processors) to improve response time and performance. The algorithm allows a communicator to take on additional jobs if it completes its initial job early. This approach aims to better balance the workload compared to other algorithms and reduce overall job completion time.
This summary provides the key details about a proposed load balancing algorithm in 3 sentences:
The document proposes a fully centralized and partially distributed load balancing algorithm that dynamically distributes tasks from a master processor to slave processors organized into communicators. The master processor monitors the workload and response time of each communicator to dynamically map additional tasks as communicators complete their work, improving resource utilization and response time. The algorithm forms a matrix to track the workload and response time of each communicator for different task types to aid the master processor in optimally balancing the load over time.
Cache coherence refers to maintaining consistency between data stored in caches and the main memory in a system with multiple processors that share memory. Without cache coherence protocols, modified data in one processor's cache may not be propagated to other caches or memory. There are different levels of cache coherence - from ensuring all processors see writes instantly to allowing different ordering of reads and writes. Cache coherence aims to ensure reads see the most recent writes and that write ordering is preserved across processors. Directory-based and snooping protocols are commonly used to maintain coherence between caches in multiprocessor systems.
This document summarizes a lecture on speculative multithreading. The lecture discussed two papers: one that categorized hardware support for speculative multithreading, and one that described a software-only approach using transactional memory. The class discussion covered dividing responsibilities between hardware and software, sources of parallelism for speculation beyond just loops, scaling speculation, and programming support options.
The document discusses parallel computing platforms and trends in microprocessor architectures that enable implicit parallelism. It covers topics like pipelining, superscalar execution, limitations of memory performance, and how caches can improve effective memory latency. The key points are:
1) Microprocessor clock speeds have increased dramatically but limitations remain regarding memory latency and bandwidth. Parallelism addresses performance bottlenecks in processors, memory, and communication.
2) Techniques like pipelining and superscalar execution exploit implicit parallelism by executing multiple instructions concurrently, but dependencies and branch prediction limit performance gains.
3) Memory latency is often the bottleneck, but caches can reduce effective latency through data reuse and temporal locality.
This document provides an introduction to parallel and distributed computing. It discusses traditional sequential programming and von Neumann architecture. It then introduces parallel computing as a way to solve larger problems faster by breaking them into discrete parts that can be solved simultaneously. The document outlines different parallel computing architectures including shared memory, distributed memory, and hybrid models. It provides examples of applications that benefit from parallel computing such as physics simulations, artificial intelligence, and medical imaging. Key challenges of parallel programming are also discussed.
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGcscpconf
Parallel computing systems compose task partitioning strategies in a true multiprocessing
manner. Such systems share the algorithm and processing unit as computing resources which
leads to highly inter process communications capabilities. The main part of the proposed
algorithm is resource management unit which performs task partitioning and co-scheduling .In
this paper, we present a technique for integrated task partitioning and co-scheduling on the
privately owned network. We focus on real-time and non preemptive systems. A large variety of
experiments have been conducted on the proposed algorithm using synthetic and real tasks.
Goal of computation model is to provide a realistic representation of the costs of programming
The results show the benefit of the task partitioning. The main characteristics of our method are
optimal scheduling and strong link between partitioning, scheduling and communication. Some
important models for task partitioning are also discussed in the paper. We target the algorithm
for task partitioning which improve the inter process communication between the tasks and use
the recourses of the system in the efficient manner. The proposed algorithm contributes the
inter-process communication cost minimization amongst the executing processes.
This document discusses parallel processing and cache coherence in computer architecture. It defines parallel processing as using multiple CPUs simultaneously to execute a program faster. It describes different types of parallel processor systems based on the number of instruction and data streams. It then discusses symmetric multiprocessors (SMPs), which have multiple similar processors that share memory and I/O facilities. Finally, it explains the cache coherence problem that can occur when multiple caches contain the same data, and describes the MESI protocol used to maintain coherence between caches.
From the perspective of Design and Analysis of Algorithm. I made these slide by collecting data from many sites.
I am Danish Javed. Student of BSCS Hons. at ITU Information Technology University Lahore, Punjab, Pakistan.
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP IJCSEIT Journal
The current multi-core architectures have become popular due to performance, and efficient processing of
multiple tasks simultaneously. Today’s the parallel algorithms are focusing on multi-core systems. The
design of parallel algorithm and performance measurement is the major issue on multi-core environment. If
one wishes to execute a single application faster, then the application must be divided into subtask or
threads to deliver desired result. Numerical problems, especially the solution of linear system of equation
have many applications in science and engineering. This paper describes and analyzes the parallel
algorithms for computing the solution of dense system of linear equations, and to approximately compute
the value of π using OpenMP interface. The performances (speedup) of parallel algorithms on multi-core
system have been presented. The experimental results on a multi-core processor show that the proposed
parallel algorithms achieves good performance (speedup) compared to the sequential
The document discusses hash tables and collision resolution techniques for hash tables. It defines hash tables as an implementation of dictionaries that use hash functions to map keys to array slots. Collisions occur when multiple keys hash to the same slot. Open addressing techniques like linear probing and quadratic probing search the array sequentially for empty slots when collisions occur. Separate chaining creates an array of linked lists so items can be inserted into lists when collisions occur.
The document discusses binary search trees and their operations. It defines key concepts like nodes, leaves, root, and tree traversal methods. It then explains how to search, insert, find minimum/maximum elements, and traverse a binary search tree. Searching a BST involves recursively comparing the target key to node keys and traversing left or right. Insertion finds the appropriate position by moving pointers down the tree until reaching an empty node.
The document discusses depth-first search (DFS) and breadth-first search (BFS) algorithms for graph traversal. It explains that DFS uses a stack to systematically visit all vertices in a graph by exploring neighboring vertices before moving to the next level, while BFS uses a queue to explore neighboring vertices at the same level before moving to the next. Examples are provided to illustrate how DFS can be used to check for graph connectivity and cyclicity.
The document discusses algorithms for finding minimum spanning trees in graphs. It describes Prim's algorithm and Kruskal's algorithm. Prim's algorithm works by gradually adding the closest vertex and edge to a growing spanning tree. Kruskal's algorithm sorts all the edges by weight and adds edges to the spanning tree if they do not form cycles. The running time of Prim's algorithm is O(V^2) while Kruskal's algorithm has a running time of O(E log E + V) where V is vertices and E is edges. Examples are provided to illustrate how each algorithm works on sample graphs.
The document discusses algorithms for finding shortest paths in graphs. It describes Dijkstra's algorithm and Bellman-Ford algorithm for solving the single-source shortest path problem. Dijkstra's algorithm runs in O(ElogV) time and works for graphs with non-negative edge weights, while Bellman-Ford algorithm runs in O(EV) time and can handle graphs with negative edge weights as long as there are no negative cycles. The document also discusses Floyd-Warshall algorithm for solving the all-pairs shortest path problem.
The document discusses greedy algorithms and their application to optimization problems. It provides examples of problems that can be solved using greedy approaches, such as fractional knapsack and making change. However, it notes that some problems like 0-1 knapsack and shortest paths on multi-stage graphs cannot be solved optimally with greedy algorithms. The document also describes various greedy algorithms for minimum spanning trees, single-source shortest paths, and fractional knapsack problems.
This document discusses greedy algorithms and dynamic programming. It explains that greedy algorithms find local optimal solutions at each step, while dynamic programming finds global optimal solutions by considering all possibilities. The document also provides examples of problems solved using each approach, such as Prim's algorithm and Dijkstra's algorithm for greedy, and knapsack problems for dynamic programming. It then discusses the matrix chain multiplication problem in detail to illustrate how a dynamic programming solution works by breaking the problem into overlapping subproblems.
The document discusses the quicksort algorithm. It begins by stating the learning goals which are to explain how quicksort works, compare it to other sorting algorithms, and discuss its advantages and disadvantages. It then provides an introduction and overview of quicksort, describing how it uses a divide and conquer approach. The document goes on to explain the details of how quicksort partitions arrays and provides examples. It analyzes the best, average, and worst case complexities of quicksort and discusses its strengths and limitations.
This document discusses the divide and conquer algorithm design strategy and provides an analysis of the merge sort algorithm as an example. It begins by explaining the divide and conquer strategy of dividing a problem into smaller subproblems, solving those subproblems recursively, and combining the solutions. It then provides pseudocode and explanations for the merge sort algorithm, which divides an array in half, recursively sorts the halves, and then merges the sorted halves back together. It analyzes the time complexity of merge sort as Θ(n log n), proving it is more efficient than insertion sort.
The document discusses counting sort, a linear time sorting algorithm. Counting sort runs in O(n) time when the integers being sorted are in the range of 1 to k, where k is O(n). It works by counting the number of elements less than or equal to each unique input element, and using this to directly place each element in the correct position of the output array. Pseudocode and an example are provided to demonstrate how counting sort iterates through the input, counts the occurrences of each unique element, and uses this to sort the elements into the output array in linear time. However, counting sort has limitations and may not be practical for large datasets due to its required extra storage space.
This document describes a facial expression recognition system created by Mehwish S. Khan for her Masters in Computer Science. The system uses Viola-Jones algorithm for face detection, uniform Gabor features for feature extraction, and a Multi-Layer Feed Forward Neural Network for classification to distinguish seven universal facial expressions (disgust, anger, fear, happiness, sadness, surprise, and normal) from static images in a person-independent manner. The document includes sections on background research, system requirements, design, and implementation.
The document contains 17 programming problems assigned to Sunawar Khan for Assignment #4. The problems involve writing programs to find the minimum and median of input integers, compute sums and series, and produce various patterns and outputs using loops like for, while, do-while, and the ternary operator. Many problems require writing programs to find prime numbers, Armstrong numbers, sums of reciprocals and squares within a given range.
The document contains 10 questions asking to write programs to perform various tasks such as:
1) Determine the grade of steel based on hardness, carbon content, and tensile strength.
2) Identify if a character entered is a capital letter, lowercase letter, digit, or symbol based on ASCII values.
3) Calculate insurance premium based on health, age, location, gender, and policy amount.
4) Calculate total salary based on basic pay and allowances.
5) Determine if a 5-digit number is equal to its reverse.
This document contains 10 programming assignments for a college-level programming course. The assignments cover a range of programming concepts and techniques including variable types, arithmetic operations, conditional statements, loops, functions, and more. Students are asked to write programs that calculate sums, remainders, commissions, ASCII values, triangle areas, book reading progress, cash denominations, quadratic equations, number separation, and value swapping without a third variable. The assignments provide examples and hints to help students complete the programs correctly.
The document describes two encryption/decryption case studies. The first case study involves encrypting and decrypting 4-digit numbers by replacing each digit with its sum plus 7 modulus 10 and swapping the first and third digits and second and fourth digits. The second case study involves taking a text message and applying 14 rules to encrypt it by changing letters, removing vowels, and substituting letters and numbers for other letters and numbers. The encrypted text is then decrypted by applying the reverse rules.
The document discusses arrays and provides information about what arrays are, different types of arrays, initializing and accessing elements of arrays, and searching arrays. Some key points:
- An array is a group of consecutive memory locations with the same name and data type. It allows storing multiple values of the same type together.
- There are different types of arrays including one-dimensional, two-dimensional, and n-dimensional arrays.
- Elements of an array can be initialized when the array is declared and assigned values. Individual elements can also be accessed using their index.
- Searching an array involves finding a required value or element. Methods like sequential search and binary search can be used to search arrays. Sequential
The operating system maintains information about each process in a data structure called a process control block (PCB). The PCB is created when a new process is started by a user and contains information like the process state, program counter, CPU registers, scheduling information, memory management details, accounting information, and I/O status. PCBs allow the OS to efficiently manage and switch between processes by providing all necessary process details in one place.
This document discusses various topics related to data transmission including:
- Data transmission involves transferring electromagnetic signals over a physical communication channel like copper wires or wireless channels.
- Transmission modes can be parallel (multiple bits sent at once) or serial (one bit at a time). Serial transmission is further divided into asynchronous and synchronous types.
- Asynchronous transmission groups data into start-stop bit sequences while synchronous transmission uses device-generated clocks for synchronization.
A computer has basic hardware components that allow it to accept data through various input devices like keyboards and mice, process the data in the processing unit and memory, store data, and produce output through output devices like monitors and printers. The main hardware components are the input unit, processing unit, memory unit, output unit, and storage. The input unit consists of direct input devices like keyboards and pointing devices, as well as audio and video input devices. The output unit provides soft copy output through monitors and speakers, and hard copy output through various printers and plotters.
The document describes the selection sort and bubble sort algorithms. Selection sort works by iterating through a list, finding the minimum element, and swapping it into the current position. Bubble sort compares adjacent elements and swaps them if out of order, iterating through the list repeatedly to put elements in sorted order. Pseudocode is provided for the algorithms' underlying logic and processes.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
How to Download & Install Module From the Odoo App Store in Odoo 17Celine George
Custom modules offer the flexibility to extend Odoo's capabilities, address unique requirements, and optimize workflows to align seamlessly with your organization's processes. By leveraging custom modules, businesses can unlock greater efficiency, productivity, and innovation, empowering them to stay competitive in today's dynamic market landscape. In this tutorial, we'll guide you step by step on how to easily download and install modules from the Odoo App Store.
1. Report on
Matrix Multiplication using Parallel Processing
Subject
Advance Computer Architecture
Submitted To
Mr. Asim Munir
Class
MSCS-F14
Prepared by: Sunawar Khan
Reg No: 813-MSCS-F14
2. Matrix Multiplication using Parallel Processing
Parallel Processing
Introduction to parallel processing
There are many application in day to day life that demand real time solution to problem. For
example weather forecasting has to done in a timely fashion etc.. If an expert system is used to
aid a physician in surgical procedures ,decisions have to be made within seconds. And so on.
Programs written for such applications have to perform an enormous amount of computation.
Even the fastest single-processor machine may not be able to come up with solutions within
tolerable time limits. Parallel Random Access Machines (PRAM) offer the potential of
decreasing the solution times enormously. E.g.
Say there are 100 numbers to be added and there are two persons A and B. Person A can add the
first 50 numbers. At the same time B can add the next 50 numbers. When they are done, one of
them can add two individual sums to get the final answer. So two people can add the 100
numbers in almost half the time required by one.
Here is another example of Computing Convex Hull using parallel processing
3. Take the set of points and divide the set into two halves Assume that recursive call computes the
convex hull of the two halves Conquer stage: take the two convex hulls and merge it to obtain
the convex hull for the entire set.
What is Computational Model
“A computational model is a mathematical model in computational science that requires
extensive computational resources to study the behavior of a complex system by computer
simulation.”
Random access machine model
Algorithms can be measured in a machine-independent way using the Random Access Machine
(RAM) model. This model assumes a single processor. In the RAM model, instructions are
executed one after the other, with no concurrent operations. This model of computation is an
abstraction that allows us to compare algorithms on the basis of performance. The assumptions
made in the RAM model to accomplish this are:
Each simple operation takes 1 time step.
Loops and subroutines are not simple operations.
Each memory access takes one time step, and there is no shortage of memory.
For any given problem the running time of algorithms is assumed to be the number of time steps.
The space used by an algorithm is assumed to be the number of RAM memory cells.
A model of computation whose memory consists of an unbounded sequence of registers, each of
which may hold an integer. In this model, arithmetic operations are allowed to compute the
address of a memory register.
In the RAM (Random Access Machine) we assume that any of the following operation can be
done in one unit of time: addition, subtraction, multiplication, division, comparison, memory
access, assignment, and so on. This model widely accepted as a valid sequential model.
An important feature of parallel computing that is absent in sequential computing is need for
interprocessor communication. For example given any problem, the processor have to
communicate among themselves and agree on the subproblems each will work on. Also they
need to communicate to see whether everyone has finished its task so on.
4. PRAM Modelfor Single ProcessorMachine.
The designer of sequential algorithm typically formulates the algorithm using an abstract model
of computation called random access machine(RAM). In this model, the machine consist of a
single processor connected to a memory system. Each CPU include arithmetic operation, logical
operation, and memory access require one step access.
Standard Random Access Machine
Each Operation load, store, jump, add, etc takes one unit of time.
Simply general one model.
Basic model for sequential algorithm
Unbounded number of local memory cells
Each memory cell can hold an integer of unbounded size
Instruction set included –simple operations, data operations, comparator, branches
All operations take unit time
Time complexity = number of instructions executed
Space complexity = number of memory cells used
Parallel Machine or multiprocessor Model
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine,
or PRAM. A multiprocessor model is a generalization of the sequenatial RAM Model in which
there is more than one processor. Multiprocessor model can be classified into three basic types,
local memory machine, modular memory machine, and Parallel Random-Access Machien. Each
describe figure below
5. “The Parallel Random Access Machine (PRAM) is an abstract model for parallel computation
which assumes that all the processors operate synchronously under a single clock and are able to
randomly access a large shared memory.”
A natural extension of the Random Access Machine(RAM), serial architecture is the
Parallel Random Access Machine, or PRAM
Pram Consist of p Processor and a global memory of unbounded size that is uniformly
accessible to all processors
Processors share a common clock but may execute different instruction in each cycle
PRAM Architecture
The parallel version of RAM constitutes an abstract model of the class of global-memory parallel
processors. The
abstraction consists of ignoring the details of the processor-to-memory interconnection network
6. and taking the view that each processor can access any memory location in each machine cycle,
independent of what other processors are doing.
Processor i can do the following in three phases of one cycle:
Fetch a value from address si in shared memory
Perform computations on data held in local registers
3. Store a value into address di in shared memory
Is an abstract machine for designing the algorithms applicable to parallel computers
Unbounded collection of RAM processors P0, P1, …,
Processors don’t have tape
Each processor has unbounded registers
Unbounded collection of share memory cells
All processors can access all memory cells in unit time
All communication via shared memory
7. Two or more processors may read simultaneously from the same cell
A write conflict occurs when two or more processors try to write simultaneously into
the same cell
Classification of PRAM Model
The classification in is reminiscent of Flynn’s classification and offers yet another example of
the quest to invent four-letter abbreviations/acronyms in computer architecture! Note that here,
too, one of the categories is not very useful, because if concurrent writes are allowed, there is no
logical reason for excluding the less problematic concurrent reads. EREW PRAM is the most
realistic of the four submodels (to the extent that thousands of processors concurrently accessing
thousands of memory locations within a shared memory address space of millions or even
billions of locations can be considered realistic!). CRCW PRAM is the least restrictive
submodel, but has the disadvantage of requiring a conflict resolution mechanism to define the
effect of concurrent writes (more on this below). The default submodel, which is assumed when
nothing is said about the submodel, is CREW PRAM. For most computations, it is fairly easy to
organize the algorithm steps so that concurrent writes to the same location are never attempted.
CRCW PRAM is further classified according to how concurrent writes are handled. Here are a
few example submodels based on the semantics of concurrent writes in CRCW PRAM:
EREW
Least “powerful”,
most “realistic”
CREW
Default
ERCW
Not useful
CRCW
Most “powerful”,
further subdivided
Reads from same location
Writestosamelocation
Exclusive
Concurrent
Concurrent
Exclusive
8. CRCW SubModel
• Undefined: In case of multiple writes, the value written is undefined (CRCW-U).
• Detecting: A special code representing “detected collision” is written (CRCW-D).
Common: Multiple writes allowed only if all store the same value (CRCW-C). This is
sometimes called the consistent-write submodel.
• Random: The value written is randomly chosen from among those offered (CRCWR).
This is sometimes called the arbitrary-write submodel.
• Priority: The processor with the lowest index succeeds in writing its value (CRCW-P).
• Max/Min: The largest/smallest of the multiple values is written (CRCW-M).
• Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR
(CRCW-X), or some other combination of the multiple values is written.
Physical Complexity
Processors and memories are connected via switches.
Since these switches must operate in O(1) time at the level of words, for a system of p
Processors and m words, the switch is O(mp).
Clearly, for meaningful values of p and m, a true PRAM is not realizable
9. Relationship between different PRAM Submodel
The following relationships have been established between some of the PRAM submodels:
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
Even though all CRCW submodels are strictly more powerful than the EREW submodel, the
latter can simulate the most powerful CRCW submodel listed above with at most logarithmic
slowdown.
A p-processor CRCW-P (priority) PRAM can be simulated by a p-processor EREW PRAM with
a slowdown factor of Θ(log p). EREW PRAM to sort or find the smallest of p values in Θ(log p)
time, as we shall see later. To avoid concurrent writes, each processor writes an ID-address-value
triple into its corresponding element of a scratch list of size p, with the p processors then
cooperating to sort the list by the destination addresses, partition the list into segments
corresponding to common addresses (which are now adjacent in the sorted list), do a reduction
operation within each segment to remove all writes to the same location except the one with the
smallest processor ID, and finally write the surviving address-value pairs into memory. This final
write operation will clearly be of the exclusive variety.
Matrix Multiplication
Introduction to Matrix Multiplication.
We will show how to implement matrix multiplication C=C+A*B on several of the
communication networks discussed in the last lecture, and develop performance models to
predict how long they take. We will see that the performance depends on several factors.
we discuss PRAM matrix multiplication algorithms as representative examples of the class of
numerical problems. Matrix multiplication is quite important in its own right and is also used as
a building block in many other parallel algorithms. For example, matrix multiplication is useful
in solving graph problems when the graphs are represented by their adjacency or weight
matrices. Given m × m matrices A and B, with elements a i jand b ij,, their product C is defined
as
10. The following O(m³)-step sequential algorithm can be used for multiplying m × m matrices:
Sequential matrix multiplication
for i = 0 to m – 1 do
for j = 0 to m – 1 do
t := 0
for k = 0 to m – 1 do
t := t + aikbkj
endfor
cij := t
endfor
endfor
Consider n*n matric multiplication with n3 processors
Each cij=∑ k=1..n aik bkj be computed on the CREW PRAM in parallel using n processors n
O(log n) time.
11. On the EREW PRAM exclusive read of aij and bij values can be satisfied by making n
copies of a and b, which takes O(log n) time with n Processors
Total time is still O(log n)
Memory requirement is ofcourse much higher for the EREW PRAM
Complexity: Θ(n3)
Better Algorithm that improve slightly
Multiplication by block
PRAM matrix multiplication using m2 processors
Proc (i, j), 0 i, j < m, do
begin
t := 0
for k = 0 to m – 1 do
t := t + aikbkj
endfor
cij := t
end
Because multiple processors will be reading the same row of A or the same column of B, the
above naive implementation of the algorithm would require the CREW submodel. However, it is
possible to convert the algorithm to an EREW PRAM algorithm by skewing the memory
accesses (how?).
matrix multiplication can be done in Θ(m²) time by using Processor i to compute the m elements
in Row i of the product matrix C in turn.
PRAM Matrix Multiplication with m Processors
for j = 0 to m – 1 Proc i, 0 i < m, do
t := 0
for k = 0 to m – 1 do
t := t + aikbkj
12. endfor
cij := t
endfor
m processors read A at once (no concurrent)
- All m processors read same column of B at same time (concurrent read should be allowed)
- if not then, Brent’s theorem states – we can convert CREW -> EREW by skewing memory
access
More Efficient Matrix Multiplication (for NUMA)
On the Cm* NUMA-type shared-memory multiprocessor, a research prototype machine built at
Carnegie-Mellon University in the 1980s, this block matrix multiplication algorithm exhibited
good, but sublinear, speed-up. With 16 processors, the speed-up was only 5 in multiplying 24 ×
24 matrices. However, the speed-up improved to about 9 when larger 36 × 36 (48 × 48) matrices
were multiplied. It is interesting to note that improved locality of the block matrix multiplication
algorithm can also improve the running time on a uniprocessor, or distributed shared-memory
multiprocessor with caches, in view of higher cache hit rates.
13. Detail of Block Matrix Multiplication
A multiply-add computation on q q blocks needs 2q 2 = 2m 2/p memory accesses and 2q 3
arithmetic operations So, q arithmetic operations are done per memory access.
iq + q - 1
iq + a
iq + 1
iq
jq jq + b jq + q - 1
kq + c
kq + c
iq + q - 1
iq + a
iq + 1
iq
jq jq + 1 jq + b jq + q - 1
Multiply
Add
Elements of
block (i, j)
in matrix C
Elements of
block (k, j)
in matrix B
Element of
block (i, k)
in matrix A
jq + 1
14. A Simple Parallel Algorithm
Example for n numbers addition:
1. We start with 4 processors and each of them adds 2 items in the first step.
2. The number of items is halved at every subsequent step. Hence logn steps are required
for adding n numbers.
3. The processor requirement is O(n).
CREW Cost
Let P(n) = O(n2)
Read n2 processors Aij all cells at once in = O(1)
Read n2 processors Bij all cells in = O(1)
Each processor multiply Aij* Bij in = O(1)
Parallel Sum to get Cij = O(logn)
Store Value Cij = O(1)
T(n) = O(log n)
P(n) = O(n2)
W(n) = O(n2log n) = total # of all operations
EREW Cost
Let P(n) = O(n2)
Read n2 processors Aij all cells at once in = O(1)
Cannot read n2 processors Bij all cells in = O(1)
Concurrent read is not allowed
Skew the memory – replicate – or - Parallel processor read
O(logn)
Each processor multiply Aij* Bij in = O(1)
Parallel Sum to get Cij = O(logn)
Store Value Cij = O(1)
T(n) = O(log n)
P(n) = O(n2)
W(n) = O(n2log n) = total # of all operations
15. Advantages of PRAM Model
PRAM removes algorithmic details concerning synchronization and communicating,
allowing the algorithm designer to focus on problem properties
A PRAM algorithm includes an explicit understanding of the operation performed at each
time unit and an explicit allocation of processors to jobs at each time unit.
PRAM designer paradigm have turned out to be robust and have been mapped efficiently
onto many other parallel models and even network models
Disadvantage of PRAM Model
Model Inaccuracies
unbounded local memory(register)
All operation takes unit time
processors run in lock steps
Unaccounted costs
Non-local memory access
Latency
Bandwidth
Memory Access Contention
Conclusion
PRAM algorithm is the source of most fundamental ideas
It’s a source of inspiration for algorithms
PRAM is simple and easy to understand.
The improved locality of block matrix multiplication can also improve the running time on a
uniprocessor, or distributed shared-memory multiprocessor with caches.
Reason: Higher Cache Higher Hit Rates