This document discusses multi-threaded algorithms and parallel computing. It begins with an introduction to multi-threaded algorithms and computational models like shared memory. It then discusses a symmetric multiprocessor model and operations like spawn, sync, and parallel. Examples of parallel algorithms for computing Fibonacci numbers, matrix multiplication, and merge sort are provided. Performance measures and parallel laws are also outlined.
The document discusses multithreaded and distributed algorithms. It describes multithreaded algorithms as having concurrent execution of parts of a program to maximize CPU utilization. Key aspects include communication models, types of threading, and performance measures. Distributed algorithms do not assume a central coordinator and are run across distributed systems without shared memory. Examples of distributed algorithms provided are breadth-first search, minimum spanning tree, naive string matching, and Rabin-Karp string matching.
- DES (Data Encryption Standard) is a symmetric block cipher algorithm that encrypts data in 64-bit blocks using a 56-bit key. It was the first encryption standard adopted by the U.S. government for protecting sensitive unclassified federal government information.
- DES works by performing 16 rounds of complex substitutions and permutations on each data block, encrypting it using the key. It has various modes of operation like ECB, CBC, CFB, OFB, and CTR that specify how it operates on data.
- In 1998, DES was broken using a brute force attack by the Electronic Frontier Foundation in just 3 days, showing the need for stronger algorithms like AES which replaced DES as the encryption standard
The document discusses classical encryption techniques, including symmetric encryption which uses the same key for encryption and decryption. It describes ciphers like the Caesar cipher which substitutes letters by shifting the alphabet, the monoalphabetic cipher with one substitution table, and the polyalphabetic Vigenère cipher which uses multiple substitution alphabets. The document also covers the Playfair cipher which encrypts letters in pairs using a 5x5 keyword matrix, and discusses cryptanalysis techniques for breaking classical ciphers.
From the perspective of Design and Analysis of Algorithm. I made these slide by collecting data from many sites.
I am Danish Javed. Student of BSCS Hons. at ITU Information Technology University Lahore, Punjab, Pakistan.
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxSumalatha A
Parallel processing involves executing multiple programs or tasks simultaneously. It can be implemented through hardware and software means in both multiprocessor and uniprocessor systems. In uniprocessors, parallelism is achieved through techniques like pipelining instructions, overlapping CPU and I/O operations, using cache memory, and multiprocessing through time-sharing. Pipelining helps execute instructions in an overlapped fashion to improve efficiency compared to non-pipelined processors. Parallel computers break large problems into smaller parallel tasks executed simultaneously on multiple processors. Types of parallel computers include pipeline computers, array processors, and multiprocessor systems.
The document discusses multithreaded and distributed algorithms. It describes multithreaded algorithms as having concurrent execution of parts of a program to maximize CPU utilization. Key aspects include communication models, types of threading, and performance measures. Distributed algorithms do not assume a central coordinator and are run across distributed systems without shared memory. Examples of distributed algorithms provided are breadth-first search, minimum spanning tree, naive string matching, and Rabin-Karp string matching.
- DES (Data Encryption Standard) is a symmetric block cipher algorithm that encrypts data in 64-bit blocks using a 56-bit key. It was the first encryption standard adopted by the U.S. government for protecting sensitive unclassified federal government information.
- DES works by performing 16 rounds of complex substitutions and permutations on each data block, encrypting it using the key. It has various modes of operation like ECB, CBC, CFB, OFB, and CTR that specify how it operates on data.
- In 1998, DES was broken using a brute force attack by the Electronic Frontier Foundation in just 3 days, showing the need for stronger algorithms like AES which replaced DES as the encryption standard
The document discusses classical encryption techniques, including symmetric encryption which uses the same key for encryption and decryption. It describes ciphers like the Caesar cipher which substitutes letters by shifting the alphabet, the monoalphabetic cipher with one substitution table, and the polyalphabetic Vigenère cipher which uses multiple substitution alphabets. The document also covers the Playfair cipher which encrypts letters in pairs using a 5x5 keyword matrix, and discusses cryptanalysis techniques for breaking classical ciphers.
From the perspective of Design and Analysis of Algorithm. I made these slide by collecting data from many sites.
I am Danish Javed. Student of BSCS Hons. at ITU Information Technology University Lahore, Punjab, Pakistan.
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxSumalatha A
Parallel processing involves executing multiple programs or tasks simultaneously. It can be implemented through hardware and software means in both multiprocessor and uniprocessor systems. In uniprocessors, parallelism is achieved through techniques like pipelining instructions, overlapping CPU and I/O operations, using cache memory, and multiprocessing through time-sharing. Pipelining helps execute instructions in an overlapped fashion to improve efficiency compared to non-pipelined processors. Parallel computers break large problems into smaller parallel tasks executed simultaneously on multiple processors. Types of parallel computers include pipeline computers, array processors, and multiprocessor systems.
The document discusses simulation using OMNeT++, an object-oriented modular discrete event network simulation framework. It provides an overview of OMNeT++'s core components including modules, gates, channels, and the simulation process. Key aspects covered include writing simulations using the OMNeT++ modeling language NED, defining simple and compound modules, connecting modules with gates and channels, and performing discrete event simulation runs in OMNeT++.
This document contains information about the Banker's algorithm, which is a deadlock avoidance algorithm used in operating systems and banking systems. It describes the key data structures used in the algorithm, including Available (resources available), Max (maximum requested by each process), Allocation (allocated to each process), and Need (remaining needs). The resource request algorithm checks that the request is within the process's declared needs and that sufficient resources are available, then updates the data structures if allocated. The safety algorithm checks that the system is in a safe state where all processes could complete by finding an order to allocate remaining resources.
The document discusses the Post Correspondence Problem (PCP) and shows that it is undecidable. It defines PCP as determining if there is a sequence of string pairs from two lists A and B that match up. It then defines the Modified PCP (MPCP) which requires the first pair to match. It shows how to reduce the Universal Language Problem to MPCP by mapping a Turing Machine and input to lists A and B, and then how to reduce MPCP to PCP. Finally, it discusses Rice's Theorem and how properties of recursively enumerable languages are undecidable.
This document summarizes and compares paging and segmentation, two common memory management techniques. Paging divides physical memory into fixed-size frames and logical memory into same-sized pages. It maps pages to frames using a page table. Segmentation divides logical memory into variable-sized segments and uses a segment table to map segment numbers to physical addresses. Paging avoids external fragmentation but can cause internal fragmentation, while segmentation avoids internal fragmentation but can cause external fragmentation. Both approaches separate logical and physical address spaces but represent different models of how a process views memory.
Inter-process communication (IPC) allows processes to communicate and synchronize. Common IPC methods include pipes, message queues, shared memory, semaphores, and mutexes. Pipes provide unidirectional communication while message queues allow full-duplex communication through message passing. Shared memory enables processes to access the same memory region. Direct IPC requires processes to explicitly name communication partners while indirect IPC uses shared mailboxes.
This document discusses product ciphers, which combine substitution and transposition ciphers for stronger encryption. It provides an example of encrypting the plaintext "COMPUTER" using a two-step product cipher. First, substitution encryption is done using a 6x6 matrix. Then, transposition encryption is performed by rearranging the ciphertext columns according to a keyword. The document explains how to encrypt another plaintext "CRYPTOGRAPHY" using the same technique with a different keyword.
Peephole optimization techniques in compiler designAnul Chaudhary
This document discusses various compiler optimization techniques, focusing on peephole optimization. It defines optimization as transforming code to run faster or use less memory without changing functionality. Optimization can be machine-independent, transforming code regardless of hardware, or machine-dependent, tailored to a specific architecture. Peephole optimization examines small blocks of code and replaces them with faster or smaller equivalents using techniques like constant folding, strength reduction, null sequence elimination, and algebraic laws. Common replacement rules aim to improve performance, reduce memory usage, and decrease code size.
The network layer provides two main services: connectionless and connection-oriented. Connectionless service routes packets independently through routers using destination addresses and routing tables. Connection-oriented service establishes a virtual circuit between source and destination, routing all related traffic along the pre-determined path. The document also discusses store-and-forward packet switching, where packets are stored until fully received before being forwarded, and services provided to the transport layer like uniform addressing.
Information and network security 35 the chinese remainder theoremVaibhav Khanna
In number theory, the Chinese remainder theorem states that if one knows the remainders of the Euclidean division of an integer n by several integers, then one can determine uniquely the remainder of the division of n by the product of these integers, under the condition that the divisors are pairwise coprime.
This document discusses string matching algorithms. It defines string matching as finding a pattern within a larger text or string. It then summarizes two common string matching algorithms: the naive algorithm and Rabin-Karp algorithm. The naive algorithm loops through all possible shifts of the pattern and directly compares characters. Rabin-Karp also shifts the pattern but compares hash values of substrings first before checking individual characters to reduce comparisons. The document provides examples of how each algorithm works on sample strings.
Error Recovery strategies and yacc | Compiler DesignShamsul Huda
This document discusses error recovery strategies in compilers and the structure of YACC programs. It describes four common error recovery strategies: panic mode recovery, phrase-level recovery, error productions, and global correction. Panic mode recovery ignores input until a synchronizing token is found. Phrase-level recovery performs local corrections. Error productions add rules to the grammar to detect anticipated errors. Global correction finds a parse tree that requires minimal changes to the input string. The document also provides an overview of YACC, noting that it generates parsers from grammar rules and allows specifying code for each recognized structure. YACC programs have declarations, rules/conditions, and auxiliary functions sections.
DES was developed as a standard for communications and data protection by an IBM research team in response to a request from the National Bureau of Standards (now called NIST). DES uses the techniques of confusion and diffusion achieved through numerous permutations and the XOR operation. The basic DES process encrypts a 64-bit block using a 56-bit key over 16 complex rounds consisting of permutations and key-dependent calculations. Triple DES was developed as a more secure version of DES.
This document discusses message authentication techniques including message encryption, message authentication codes (MACs), and hash functions. It describes how each technique can be used to authenticate messages and protect against various security threats. It also covers how symmetric and asymmetric encryption can provide authentication when used with MACs or digital signatures. Specific MAC and hash functions are examined like HMAC, SHA-1, and SHA-2. X.509 is introduced as a standard for digital certificates.
Chapter 1 Introduction of Cryptography and Network security Dr. Kapil Gupta
(1) Cryptography and network security are important topics that involve terminology like plaintext, ciphertext, encryption, decryption, and cryptanalysis. (2) The document discusses principles of security like confidentiality, integrity, authentication, non-repudiation, and availability and how attacks can compromise them. (3) It also covers security services, mechanisms, and models in the OSI standard to enhance security and counter different types of security attacks.
Cryptography is the practice and study of techniques for conveying information security.
The goal of Cryptography is to allow the intended recipients of the message to receive the message securely.
The most famous algorithm used today is RSA algorithm
The document discusses compilers and their role in translating high-level programming languages into machine-readable code. It notes that compilers perform several key functions: lexical analysis, syntax analysis, generation of an intermediate representation, optimization of the intermediate code, and finally generation of assembly or machine code. The compiler allows programmers to write code in a high-level language that is easier for humans while still producing efficient low-level code that computers can execute.
An introduction to asymmetric cryptography with an in-depth look at RSA, Diffie-Hellman, the FREAK and LOGJAM attacks on TLS/SSL, and the "Mining your P's and Q's attack".
The document discusses key management and distribution in cryptography. It covers topics such as key generation, the different types of keys including symmetric and asymmetric keys, how symmetric and asymmetric encryption works, different methods of key distribution including public key distribution and private key distribution, and an overview of public key infrastructure. The goal of key management is to support the establishment and maintenance of secure key relationships between authorized parties.
O documento discute multithreading em Java, abordando tópicos como threads, seu ciclo de vida, agendamento, criação e execução. Apresenta também exemplos de produtor-consumidor e uso de threads com GUI, mostrando como resolver problemas de concorrência de forma segura.
The document discusses simulation using OMNeT++, an object-oriented modular discrete event network simulation framework. It provides an overview of OMNeT++'s core components including modules, gates, channels, and the simulation process. Key aspects covered include writing simulations using the OMNeT++ modeling language NED, defining simple and compound modules, connecting modules with gates and channels, and performing discrete event simulation runs in OMNeT++.
This document contains information about the Banker's algorithm, which is a deadlock avoidance algorithm used in operating systems and banking systems. It describes the key data structures used in the algorithm, including Available (resources available), Max (maximum requested by each process), Allocation (allocated to each process), and Need (remaining needs). The resource request algorithm checks that the request is within the process's declared needs and that sufficient resources are available, then updates the data structures if allocated. The safety algorithm checks that the system is in a safe state where all processes could complete by finding an order to allocate remaining resources.
The document discusses the Post Correspondence Problem (PCP) and shows that it is undecidable. It defines PCP as determining if there is a sequence of string pairs from two lists A and B that match up. It then defines the Modified PCP (MPCP) which requires the first pair to match. It shows how to reduce the Universal Language Problem to MPCP by mapping a Turing Machine and input to lists A and B, and then how to reduce MPCP to PCP. Finally, it discusses Rice's Theorem and how properties of recursively enumerable languages are undecidable.
This document summarizes and compares paging and segmentation, two common memory management techniques. Paging divides physical memory into fixed-size frames and logical memory into same-sized pages. It maps pages to frames using a page table. Segmentation divides logical memory into variable-sized segments and uses a segment table to map segment numbers to physical addresses. Paging avoids external fragmentation but can cause internal fragmentation, while segmentation avoids internal fragmentation but can cause external fragmentation. Both approaches separate logical and physical address spaces but represent different models of how a process views memory.
Inter-process communication (IPC) allows processes to communicate and synchronize. Common IPC methods include pipes, message queues, shared memory, semaphores, and mutexes. Pipes provide unidirectional communication while message queues allow full-duplex communication through message passing. Shared memory enables processes to access the same memory region. Direct IPC requires processes to explicitly name communication partners while indirect IPC uses shared mailboxes.
This document discusses product ciphers, which combine substitution and transposition ciphers for stronger encryption. It provides an example of encrypting the plaintext "COMPUTER" using a two-step product cipher. First, substitution encryption is done using a 6x6 matrix. Then, transposition encryption is performed by rearranging the ciphertext columns according to a keyword. The document explains how to encrypt another plaintext "CRYPTOGRAPHY" using the same technique with a different keyword.
Peephole optimization techniques in compiler designAnul Chaudhary
This document discusses various compiler optimization techniques, focusing on peephole optimization. It defines optimization as transforming code to run faster or use less memory without changing functionality. Optimization can be machine-independent, transforming code regardless of hardware, or machine-dependent, tailored to a specific architecture. Peephole optimization examines small blocks of code and replaces them with faster or smaller equivalents using techniques like constant folding, strength reduction, null sequence elimination, and algebraic laws. Common replacement rules aim to improve performance, reduce memory usage, and decrease code size.
The network layer provides two main services: connectionless and connection-oriented. Connectionless service routes packets independently through routers using destination addresses and routing tables. Connection-oriented service establishes a virtual circuit between source and destination, routing all related traffic along the pre-determined path. The document also discusses store-and-forward packet switching, where packets are stored until fully received before being forwarded, and services provided to the transport layer like uniform addressing.
Information and network security 35 the chinese remainder theoremVaibhav Khanna
In number theory, the Chinese remainder theorem states that if one knows the remainders of the Euclidean division of an integer n by several integers, then one can determine uniquely the remainder of the division of n by the product of these integers, under the condition that the divisors are pairwise coprime.
This document discusses string matching algorithms. It defines string matching as finding a pattern within a larger text or string. It then summarizes two common string matching algorithms: the naive algorithm and Rabin-Karp algorithm. The naive algorithm loops through all possible shifts of the pattern and directly compares characters. Rabin-Karp also shifts the pattern but compares hash values of substrings first before checking individual characters to reduce comparisons. The document provides examples of how each algorithm works on sample strings.
Error Recovery strategies and yacc | Compiler DesignShamsul Huda
This document discusses error recovery strategies in compilers and the structure of YACC programs. It describes four common error recovery strategies: panic mode recovery, phrase-level recovery, error productions, and global correction. Panic mode recovery ignores input until a synchronizing token is found. Phrase-level recovery performs local corrections. Error productions add rules to the grammar to detect anticipated errors. Global correction finds a parse tree that requires minimal changes to the input string. The document also provides an overview of YACC, noting that it generates parsers from grammar rules and allows specifying code for each recognized structure. YACC programs have declarations, rules/conditions, and auxiliary functions sections.
DES was developed as a standard for communications and data protection by an IBM research team in response to a request from the National Bureau of Standards (now called NIST). DES uses the techniques of confusion and diffusion achieved through numerous permutations and the XOR operation. The basic DES process encrypts a 64-bit block using a 56-bit key over 16 complex rounds consisting of permutations and key-dependent calculations. Triple DES was developed as a more secure version of DES.
This document discusses message authentication techniques including message encryption, message authentication codes (MACs), and hash functions. It describes how each technique can be used to authenticate messages and protect against various security threats. It also covers how symmetric and asymmetric encryption can provide authentication when used with MACs or digital signatures. Specific MAC and hash functions are examined like HMAC, SHA-1, and SHA-2. X.509 is introduced as a standard for digital certificates.
Chapter 1 Introduction of Cryptography and Network security Dr. Kapil Gupta
(1) Cryptography and network security are important topics that involve terminology like plaintext, ciphertext, encryption, decryption, and cryptanalysis. (2) The document discusses principles of security like confidentiality, integrity, authentication, non-repudiation, and availability and how attacks can compromise them. (3) It also covers security services, mechanisms, and models in the OSI standard to enhance security and counter different types of security attacks.
Cryptography is the practice and study of techniques for conveying information security.
The goal of Cryptography is to allow the intended recipients of the message to receive the message securely.
The most famous algorithm used today is RSA algorithm
The document discusses compilers and their role in translating high-level programming languages into machine-readable code. It notes that compilers perform several key functions: lexical analysis, syntax analysis, generation of an intermediate representation, optimization of the intermediate code, and finally generation of assembly or machine code. The compiler allows programmers to write code in a high-level language that is easier for humans while still producing efficient low-level code that computers can execute.
An introduction to asymmetric cryptography with an in-depth look at RSA, Diffie-Hellman, the FREAK and LOGJAM attacks on TLS/SSL, and the "Mining your P's and Q's attack".
The document discusses key management and distribution in cryptography. It covers topics such as key generation, the different types of keys including symmetric and asymmetric keys, how symmetric and asymmetric encryption works, different methods of key distribution including public key distribution and private key distribution, and an overview of public key infrastructure. The goal of key management is to support the establishment and maintenance of secure key relationships between authorized parties.
O documento discute multithreading em Java, abordando tópicos como threads, seu ciclo de vida, agendamento, criação e execução. Apresenta também exemplos de produtor-consumidor e uso de threads com GUI, mostrando como resolver problemas de concorrência de forma segura.
This document discusses parallel algorithms for merging sorted arrays and ranking elements in a linked list. It presents:
1. A parallel merging algorithm that partitions the arrays into blocks, ranks the elements within matching blocks in parallel, and sequentially merges the smaller blocks. This takes O(log n) time using O(n) processors.
2. Two list ranking algorithms - one in O(log n) time and O(n log n) work, and an optimal one in O(log n log log n) time and O(n) work using independent sets and pointer jumping.
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureWaqas Tariq
In this paper, we present a comparative performance analysis of different parallel sorting algorithms: Bitonic sort and Parallel Radix Sort. In order to study the interaction between the algorithms and architecture, we implemented both the algorithms in OpenCL and compared its performance with Quick Sort algorithm, the fastest algorithm. In our simulation, we have used Intel Core2Duo CPU 2.67GHz and NVidia Quadro FX 3800 as graphical processing unit.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
optimizing code in compilers using parallel genetic algorithm Fatemeh Karimi
The document discusses optimizing code using a parallel genetic algorithm approach. It begins with introductions to compiler optimization, optimization levels in GNU GCC, and the challenges of phase ordering. It then describes the methodology which uses a master-slave model to evaluate populations of optimization flag combinations in parallel. Experimental results show the parallel genetic algorithm approach improves performance over random optimization or no optimization. In conclusion, this approach is well-suited to the compiler optimization problem and showed increased performance with more processor cores.
Comparing different concurrency models on the JVMMario Fusco
This document compares different concurrency models on the JVM, including the native Java concurrency model based on threads, semaphores, and locks. It discusses the pros and cons of threads and locks, including issues around non-determinism from shared mutable state. Functional programming with immutable data and internal iteration is presented as an alternative that allows for free parallelism. The document also covers actors and their message passing approach, as well as software transactional memory.
Webinar slides: Top 9 Tips for building a stable MySQL Replication environmentSeveralnines
MySQL replication is a widely known and proven solution to build scalable clusters of databases. It is very easy to deploy, even easier with GTID. Easy deployment doesn't mean you don't need knowledge and skills to operate it correctly. If you'd like to learn what is needed to build a stable environment using MySQL replication, this webinar is for you.
AGENDA
1. Sanity checks before migrating into MySQL replication setup
2. Operating system configuration
3. Replication
4. Backup
5. Provisioning
6. Performance
7. Schema changes
8. Reporting
9. Disaster recovery
SPEAKER
Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.
The document discusses message queues and their uses. Message queues allow for asynchronous communication between applications and components. They decouple systems, allow for background processing, and improve scalability. Common use cases for message queues include processing email notifications, auto-scaling cloud applications, handling image/video processing, and interacting with services like Apple Push Notifications.
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
The document provides tips on common scalability mistakes made when designing web applications and strategies to avoid them. It discusses the importance of considering scalability from the beginning, avoiding blocking calls, caching frequently accessed data, optimizing database and file system usage, and using tools like profilers to identify bottlenecks. The key is designing applications that can scale both up and down based on current needs through a proactive, process-oriented approach.
Java performance - not so scary after allHolly Cummins
No one likes slow applications, but sometimes it's hard to know where to start when trying to fix a performance problem. This talk will cover a range of tools and techniques which can be used to track down and fix performance issues.
Topics covered:
Why performance really really matters
What's the garbage collector doing? (And why you should care.)
But why is the garbage collector doing all that, anyway? How to find out what's in your heap.
Are you waiting around on locks?
Is your application running the code it should be?
Pulling it all together
This document outlines strategies for tuning program performance on POWER9 processors. It discusses how performance bottlenecks can arise in the processor front-end and back-end and describes some compiler flags, pragmas, and source code techniques for addressing these bottlenecks. These include techniques like unrolling, inlining, prefetching, parallelization with OpenMP, and leveraging GPUs with OpenACC. Hands-on exercises are provided to demonstrate applying these optimizations to a Jacobi application and measuring the performance impacts.
This presentation discusses performance and scalability testing and optimization for Drupal websites. It covers capacity planning, tools for testing and analysis, common bottlenecks, caching, database optimization, and Apache configuration tips. Specific technologies and modules mentioned include Varnish, APC, MySQL query cache, and Drupal caching.
Presentació a càrrec de Cristian Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de març de 2021 en format virtual.
The document discusses best practices for scalability and performance when developing PHP applications. Some key points include profiling and optimizing early, cooperating between development and operations teams, testing on production-like data, caching frequently accessed data, avoiding overuse of hard-to-scale resources, and using compiler caching and query optimization. Decoupling applications, caching, data federation, and replication are also presented as techniques for improving scalability.
The cache is a small amount of fast memory located close to the CPU that stores frequently accessed and nearby data from main memory in order to speed up data access times for the CPU. Without cache, every data request from the CPU would require accessing the slower main memory. Caches exploit the principle of locality of reference, where programs tend to access the same data repeatedly, to improve performance by fulfilling many requests from the faster cache instead of main memory.
This document discusses CPU scheduling algorithms in operating systems, focusing on multilevel queue scheduling and real-time scheduling. Multilevel queue scheduling partitions processes into multiple queues based on attributes like process type. It allows for prioritization between queues. Real-time scheduling aims to minimize latency for time-sensitive processes through techniques like priority-based preemptive scheduling and bounding interrupt/dispatch latency. The document also covers scheduling on multiprocessors and considerations like processor affinity and load balancing.
Steve Maraspin gave a presentation on asynchronous and parallel PHP. He began by demonstrating how blocking I/O calls can slow down PHP applications and presented several approaches to address this issue including asynchronous calls, message queues, job servers, forking processes, stream_select, libevent, and React. He discussed the pros and cons of each approach and provided code examples. He concluded by discussing additional asynchronous options for PHP like Hack and Node.js and thanked the audience.
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011Ulf Wendel
A quick overview on the MySQL native driver for PHP (mysqlnd) and its unique features. Edition 2011. What is mysqlnd, why use it, which plugins exist, where to find more information.... the current state. Expect a new summary every year.
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
The document discusses common patterns and approaches for scaling web architectures. It covers topics like load balancing, caching, database scaling through replication and sharding, high availability, and storing large files across multiple servers and data centers. The overall goal is to discuss how to architect systems that can scale horizontally to handle increasing traffic and data sizes.
Presentació a càrrec de Cristian
Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "4a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 17 de març de 2021 en format virtual.
The document discusses several applications of parallel processing:
1. Airfoil design optimization utilizes parallel processors to simultaneously evaluate the aerodynamic performance of multiple airfoil geometries, greatly decreasing computational time.
2. Parallel databases can distribute queries across multiple processors, improving response time for online transactions and reducing time for complex queries through intra-query parallelism.
3. Weather forecasting uses parallel computers to quickly process huge volumes of environmental data through thousands of computational iterations needed to generate useful forecasts.
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
Often, HPC software outlives the HPC systems for which they are initially developed. The innovations in the HPC platforms’ hardware and parallel programming standards drive the modernization of HPC applications so that they continue being performant. While such code modernization efforts may not be very challenging for HPC experts and well-funded research groups, many domain-experts may find it challenging to adapt their applications for latest HPC platforms due to lack of expertise, time, and funds. The challenges of such domain-experts can be mitigated by providing them high-level tools for code modernization and migration.
This document outlines an introduction to Bayesian estimation. It discusses key concepts like the likelihood principle, sufficiency, and Bayesian inference. The likelihood principle states that all experimental information about an unknown parameter is contained within the likelihood function. An example is provided testing the fairness of a coin using different data collection scenarios to illustrate how the likelihood function remains the same. The document also discusses the history of the likelihood principle and provides an outline of topics to be covered.
The document discusses linear transformations and their applications in mathematics for artificial intelligence. It begins by introducing linear transformations and how matrices can be used to define functions. It describes how a matrix A can define a linear transformation fA that maps vectors in Rn to vectors in Rm. It also defines key concepts for linear transformations like the kernel, range, row space, and column space. The document will continue exploring topics like the derivative of transformations, linear regression, principal component analysis, and singular value decomposition.
This document provides an introduction and outline for a discussion of orthonormal bases and eigenvectors. It begins with an overview of orthonormal bases, including definitions of the dot product, norm, orthogonal vectors and subspaces, and orthogonal complements. It also discusses the relationship between the null space and row space of a matrix. The document then provides an introduction to eigenvectors and outlines topics that will be covered, including what eigenvectors are useful for and how to find and use them.
The document discusses square matrices and determinants. It begins by noting that square matrices are the only matrices that can have inverses. It then presents an algorithm for calculating the inverse of a square matrix A by forming the partitioned matrix (A|I) and applying Gauss-Jordan reduction. The document also discusses determinants, defining them recursively as the sum of products of diagonal entries with signs depending on row/column position, for matrices larger than 1x1. Complexity increases exponentially with matrix size.
This document provides an introduction to systems of linear equations and matrix operations. It defines key concepts such as matrices, matrix addition and multiplication, and transitions between different bases. It presents an example of multiplying two matrices using NumPy. The document outlines how systems of linear equations can be represented using matrices and discusses solving systems using techniques like Gauss-Jordan elimination and elementary row operations. It also introduces the concepts of homogeneous and inhomogeneous systems.
This document provides an outline and introduction to a course on mathematics for artificial intelligence, with a focus on vector spaces and linear algebra. It discusses:
1. A brief history of linear algebra, from ancient Babylonians solving systems of equations to modern definitions of matrices.
2. The definition of a vector space as a set that can be added and multiplied by elements of a field, with properties like closure under addition and scalar multiplication.
3. Examples of using matrices and vectors to model systems of linear equations and probabilities of transitions between web pages.
4. The importance of linear algebra concepts like bases, dimensions, and eigenvectors/eigenvalues for machine learning applications involving feature vectors and least squares error.
This document outlines and discusses backpropagation and automatic differentiation. It begins with an introduction to backpropagation, describing how it works in two phases: feed-forward to calculate outputs, and backpropagation to calculate gradients using the chain rule. It then discusses automatic differentiation, noting that it provides advantages over symbolic differentiation. The document explores the forward and reverse modes of automatic differentiation and examines their implementation and complexity. In summary, it covers the fundamental algorithms and methods for calculating gradients in neural networks.
This document summarizes the services of a company that provides data analysis and machine learning solutions. They have an interdisciplinary team with over 15 years of experience in areas like machine learning, artificial intelligence, big data, and data engineering. Their expertise includes developing data models, analysis products, and systems to help companies with forecasting, decision making, and improving data operations efficiency. They can help clients across various industries like telecom, finance, retail, and more.
My first set of slides (The NN and DL class I am preparing for the fall)... I included the problem of Vanishing Gradient and the need to have ReLu (Mentioning btw the saturation problem inherited from Hebbian Learning)
Reinforcement learning is a method for learning behaviors through trial-and-error interactions with an environment. The goal is to maximize a numerical reward signal by discovering the actions that yield the most reward. The learner is not told which actions to take directly, but must instead determine which actions are best by trying them out. This document outlines reinforcement learning concepts like exploration versus exploitation, where exploration involves trying non-optimal actions to gain more information, while exploitation uses current knowledge to choose optimal actions. It also discusses formalisms like Markov decision processes and the tradeoff between maximizing short-term versus long-term rewards in reinforcement learning problems.
This document provides an overview of a 65-hour course on neural networks and deep learning taught by Andres Mendez Vazquez at Cinvestav Guadalajara. The course objectives are to introduce students to concepts of neural networks, with a focus on various neural network architectures and their applications. Topics covered include traditional neural networks, deep learning, optimization techniques for training deep models, and specific deep learning architectures like convolutional and recurrent neural networks. The course grades are based on midterms, homework assignments, and a final project.
This document provides a syllabus for an introduction to artificial intelligence course. It outlines 14 topics that will be covered in the class, including what AI is, the mathematics behind it like probability and linear algebra, search techniques, constraint satisfaction problems, probabilistic reasoning, Bayesian networks, graphical models, neural networks, machine learning, planning, knowledge representation, reinforcement learning, logic in AI, and genetic algorithms. It also lists the course requirements, which include exams, homework, and a group project to simulate predators and prey.
The document outlines a proposed 8 semester curriculum for a Bachelor's degree in Machine Learning and Data Science. The curriculum covers fundamental topics in mathematics, computer science, statistics, physics and artificial intelligence in the first 4 semesters. Later semesters focus on more advanced topics in artificial intelligence, machine learning, neural networks, databases, and parallel programming. The final semester emphasizes practical applications of machine learning and data science through courses on large-scale systems and non-traditional databases.
This document outlines the syllabus for a course on analysis of algorithms and complexity. The course will cover foundational topics like asymptotic analysis and randomized algorithms, as well as specific algorithms like sorting, searching trees, and graph algorithms. It will also cover advanced techniques like dynamic programming, greedy algorithms, and amortized analysis. Later topics will include NP-completeness, multi-threaded algorithms, and approaches for dealing with NP-complete problems. The requirements include exams, homework assignments, and a project, and the course will be taught in English.
A review of one of the most popular methods of clustering, a part of what is know as unsupervised learning, K-Means. Here, we go from the basic heuristic used to solve the NP-Hard problem to an approximation algorithm K-Centers. Additionally, we look at variations coming from the Fuzzy Set ideas. In the future, we will add more about On-Line algorithms in the line of Stochastic Gradient Ideas...
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
This document provides an introduction to machine learning concepts including loss functions, empirical risk, and two basic methods of learning - least squared error and nearest neighborhood. It describes how machine learning aims to find an optimal function that minimizes empirical risk under a given loss function. Least squared error learning is discussed as minimizing the squared differences between predictions and labels. Nearest neighborhood is also introduced as an alternative method. The document serves as a high-level overview of fundamental machine learning principles.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
2. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
2 / 94
3. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
3 / 94
4. Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
5. Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
6. Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
7. Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
8. Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
9. Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
10. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
5 / 94
11. The Model to Be Used
Symmetric Multiprocessor
The model that we will use is the Symmetric Multiprocessor (SMP) where
a shared memory exists.
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
MAIN SHARED MEMORY
Processor 1 Processor 2 Processor 3 Processor 4
BUS
6 / 94
12. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
13. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
14. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
15. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
16. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
17. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
18. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
19. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
20. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
21. Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
22. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
8 / 94
23. SPAWN
SPAWN
When called before a procedure, the parent procedure may continue to
execute in parallel.
Note
The keyword spawn does not say anything about concurrent
execution, but it can happen.
The Scheduler decide which computations should run concurrently.
9 / 94
24. SPAWN
SPAWN
When called before a procedure, the parent procedure may continue to
execute in parallel.
Note
The keyword spawn does not say anything about concurrent
execution, but it can happen.
The Scheduler decide which computations should run concurrently.
9 / 94
25. SPAWN
SPAWN
When called before a procedure, the parent procedure may continue to
execute in parallel.
Note
The keyword spawn does not say anything about concurrent
execution, but it can happen.
The Scheduler decide which computations should run concurrently.
9 / 94
26. SYNC AND PARALLEL
SYNC
The keyword sync indicates that the procedure must wait for all its
spawned children to complete.
PARALLEL
This operation applies to loops, which make possible to execute the body
of the loop in parallel.
10 / 94
27. SYNC AND PARALLEL
SYNC
The keyword sync indicates that the procedure must wait for all its
spawned children to complete.
PARALLEL
This operation applies to loops, which make possible to execute the body
of the loop in parallel.
10 / 94
28. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
11 / 94
29. A Classic Parallel Piece of Code: Fibonacci Numbers
Fibonacci’s Definition
F0 = 0
F1 = 1
Fi = Fi−1 + Fi−2 for i > 1.
Naive Algorithm
Fibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 return x + y
12 / 94
30. A Classic Parallel Piece of Code: Fibonacci Numbers
Fibonacci’s Definition
F0 = 0
F1 = 1
Fi = Fi−1 + Fi−2 for i > 1.
Naive Algorithm
Fibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 return x + y
12 / 94
31. Time Complexity
Recursion and Complexity
Recursion T (n) =T (n − 1) + T (n − 2) + Θ (1).
Complexity T(n) = Θ (Fn) = Θ (φn), φ = 1+
√
5
2 .
13 / 94
32. Time Complexity
Recursion and Complexity
Recursion T (n) =T (n − 1) + T (n − 2) + Θ (1).
Complexity T(n) = Θ (Fn) = Θ (φn), φ = 1+
√
5
2 .
13 / 94
33. There is a Better Way
We can order the first tree numbers in the sequence as
F2 F1
F1 F0
=
1 1
1 0
Then
F2 F1
F1 F0
F2 F1
F1 F0
=
1 1
1 0
1 1
1 0
=
2 1
1 1
=
F3 F2
F2 F1
14 / 94
34. There is a Better Way
We can order the first tree numbers in the sequence as
F2 F1
F1 F0
=
1 1
1 0
Then
F2 F1
F1 F0
F2 F1
F1 F0
=
1 1
1 0
1 1
1 0
=
2 1
1 1
=
F3 F2
F2 F1
14 / 94
35. There is a Better Way
Calculating in O(log n) when n is a power of 2
1 1
1 0
n
=
F (n + 1) F (n)
F (n) F (n − 1)
Thus
1 1
1 0
n
2
1 1
1 0
n
2
=
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
However...
We will use the naive version to illustrate the principles of parallel
programming.
15 / 94
36. There is a Better Way
Calculating in O(log n) when n is a power of 2
1 1
1 0
n
=
F (n + 1) F (n)
F (n) F (n − 1)
Thus
1 1
1 0
n
2
1 1
1 0
n
2
=
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
However...
We will use the naive version to illustrate the principles of parallel
programming.
15 / 94
37. The Concurrent Code
Parallel Algorithm
PFibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = spawn Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 sync
6 return x + y
16 / 94
38. The Concurrent Code
Parallel Algorithm
PFibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = spawn Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 sync
6 return x + y
16 / 94
39. The Concurrent Code
Parallel Algorithm
PFibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = spawn Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 sync
6 return x + y
16 / 94
40. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
17 / 94
41. How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
42. How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
43. How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
44. How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
45. How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
46. How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
47. How do we compute a complexity? Computation DAG
Thus
If there is an edge between thread u and v, then they are said to be
(logically) in series.
If there is no edge, then they are said to be (logically) in parallel.
19 / 94
48. How do we compute a complexity? Computation DAG
Thus
If there is an edge between thread u and v, then they are said to be
(logically) in series.
If there is no edge, then they are said to be (logically) in parallel.
19 / 94
50. Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
51. Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
52. Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
53. Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
55. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
23 / 94
56. Performance Measures
WORK
The work of a multi-threaded computation is the total time to execute the
entire computation on one processor.
Work =
i∈I
Time (Threadi)
SPAN
The span is the longest time to execute the strands along any path of the
DAG.
In a DAG which each strand takes unit time, the span equals the
number of vertices on a longest or critical path in the DAG.
24 / 94
57. Performance Measures
WORK
The work of a multi-threaded computation is the total time to execute the
entire computation on one processor.
Work =
i∈I
Time (Threadi)
SPAN
The span is the longest time to execute the strands along any path of the
DAG.
In a DAG which each strand takes unit time, the span equals the
number of vertices on a longest or critical path in the DAG.
24 / 94
59. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
60. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
61. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
62. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
63. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
64. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
65. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
66. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
67. Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
68. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
27 / 94
69. Running Time Classification
Single Processor
T1 running time on a single processor.
Multiple Processors
Tp running time on P processors.
Unlimited Processors
T∞ running time on unlimited processors, also called the span, if we
run each strand on its own processor.
28 / 94
70. Running Time Classification
Single Processor
T1 running time on a single processor.
Multiple Processors
Tp running time on P processors.
Unlimited Processors
T∞ running time on unlimited processors, also called the span, if we
run each strand on its own processor.
28 / 94
71. Running Time Classification
Single Processor
T1 running time on a single processor.
Multiple Processors
Tp running time on P processors.
Unlimited Processors
T∞ running time on unlimited processors, also called the span, if we
run each strand on its own processor.
28 / 94
72. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
29 / 94
73. Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
74. Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
75. Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
76. Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
77. Span Law
Definition
A P-processor ideal parallel computer cannot run faster than a
machine with unlimited number of processors.
However, a computer with unlimited number of processors can
emulate a P-processor machine by using simply P of its processors.
Therefore,
TP ≥ T∞
31 / 94
78. Span Law
Definition
A P-processor ideal parallel computer cannot run faster than a
machine with unlimited number of processors.
However, a computer with unlimited number of processors can
emulate a P-processor machine by using simply P of its processors.
Therefore,
TP ≥ T∞
31 / 94
79. Work Calculations: Serial
Serial Computations
A B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B).
32 / 94
80. Work Calculations: Serial
Serial Computations
A B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B).
32 / 94
81. Work Calculations: Serial
Serial Computations
A B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B).
32 / 94
82. Work Calculations: Parallel
Parallel Computations
A
B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}.
33 / 94
83. Work Calculations: Parallel
Parallel Computations
A
B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}.
33 / 94
84. Work Calculations: Parallel
Parallel Computations
A
B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}.
33 / 94
85. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
34 / 94
86. Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
87. Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
88. Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
89. Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
90. Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
91. Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
92. Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
93. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
36 / 94
94. Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
95. Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
96. Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
97. Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
98. Greedy Scheduler Theorem and Corollaries
Theorem 27.1
On an ideal parallel computer with P processors, a greedy scheduler
executes a multi-threaded computation with work T1 and span T∞ in time
TP ≤ T1
P + T∞.
Corollary 27.2
The running time TP of any multi-threaded computation scheduled by a
greedy scheduler on an ideal parallel computer with P processors is within
a factor of 2 of optimal.
Corollary 27.3
Let TP be the running time of a multi-threaded computation produced by
a greedy scheduler on an ideal parallel computer with P processors, and let
T1 and T∞ be the work and span of the computation, respectively. Then,
if P T1
T∞
(Much Less), we have TP ≈ T1
P , or equivalently, a speedup of
approximately P .
38 / 94
99. Greedy Scheduler Theorem and Corollaries
Theorem 27.1
On an ideal parallel computer with P processors, a greedy scheduler
executes a multi-threaded computation with work T1 and span T∞ in time
TP ≤ T1
P + T∞.
Corollary 27.2
The running time TP of any multi-threaded computation scheduled by a
greedy scheduler on an ideal parallel computer with P processors is within
a factor of 2 of optimal.
Corollary 27.3
Let TP be the running time of a multi-threaded computation produced by
a greedy scheduler on an ideal parallel computer with P processors, and let
T1 and T∞ be the work and span of the computation, respectively. Then,
if P T1
T∞
(Much Less), we have TP ≈ T1
P , or equivalently, a speedup of
approximately P .
38 / 94
100. Greedy Scheduler Theorem and Corollaries
Theorem 27.1
On an ideal parallel computer with P processors, a greedy scheduler
executes a multi-threaded computation with work T1 and span T∞ in time
TP ≤ T1
P + T∞.
Corollary 27.2
The running time TP of any multi-threaded computation scheduled by a
greedy scheduler on an ideal parallel computer with P processors is within
a factor of 2 of optimal.
Corollary 27.3
Let TP be the running time of a multi-threaded computation produced by
a greedy scheduler on an ideal parallel computer with P processors, and let
T1 and T∞ be the work and span of the computation, respectively. Then,
if P T1
T∞
(Much Less), we have TP ≈ T1
P , or equivalently, a speedup of
approximately P .
38 / 94
101. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
39 / 94
102. Race Conditions
Determinacy Race
A determinacy race occurs when two logically parallel instructions access
the same memory location and at least one of the instructions performs a
write.
Example
Race-Example()
1 x = 0
2 parallel for i = 1 to 3 do
3 x = x + 1
4 print x
40 / 94
103. Race Conditions
Determinacy Race
A determinacy race occurs when two logically parallel instructions access
the same memory location and at least one of the instructions performs a
write.
Example
Race-Example()
1 x = 0
2 parallel for i = 1 to 3 do
3 x = x + 1
4 print x
40 / 94
105. Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
106. Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
107. Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
108. Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
109. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
43 / 94
113. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
45 / 94
114. Matrix Multiplication
Trick
To multiply two n × n matrices, we perform 8 matrix multiplications of
n
2 × n
2 matrices and one addition n × n of matrices.
Idea
A=
A11 A12
A21 A22
,B =
B11 B12
B21 B22
, C =
C11 C12
C21 C22
C =
C11 C12
C21 C22
=
A11 A12
A21 A22
B11 B12
B21 B22
= ...
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
46 / 94
115. Matrix Multiplication
Trick
To multiply two n × n matrices, we perform 8 matrix multiplications of
n
2 × n
2 matrices and one addition n × n of matrices.
Idea
A=
A11 A12
A21 A22
,B =
B11 B12
B21 B22
, C =
C11 C12
C21 C22
C =
C11 C12
C21 C22
=
A11 A12
A21 A22
B11 B12
B21 B22
= ...
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
46 / 94
116. Any Idea to Parallelize the Code?
What do you think?
Did you notice the multiplications of sub-matrices?
Then What?
We have for example A11B11 and A12B21!!!
We can do the following
A11B11 + A12B21
47 / 94
117. Any Idea to Parallelize the Code?
What do you think?
Did you notice the multiplications of sub-matrices?
Then What?
We have for example A11B11 and A12B21!!!
We can do the following
A11B11 + A12B21
47 / 94
118. Any Idea to Parallelize the Code?
What do you think?
Did you notice the multiplications of sub-matrices?
Then What?
We have for example A11B11 and A12B21!!!
We can do the following
A11B11 + A12B21
47 / 94
119. The use of the recursion!!!
As always our friend!!!
48 / 94
120. Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
121. Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
122. Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
123. Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
124. Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
125. Explanation
Lines 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
Extra matrix for storing the second matrix in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
T
Line 5
Do the desired partition!!!
50 / 94
126. Explanation
Lines 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
Extra matrix for storing the second matrix in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
T
Line 5
Do the desired partition!!!
50 / 94
127. Explanation
Lines 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
Extra matrix for storing the second matrix in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
T
Line 5
Do the desired partition!!!
50 / 94
128. Explanation
Lines 6 to 13
Calculating the products in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
Using Recursion and Parallel Computations
Line 14
A barrier to wait until all the parallel computations are done!!!
Line 15
Call Matrix − Add to add C and T.
51 / 94
129. Explanation
Lines 6 to 13
Calculating the products in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
Using Recursion and Parallel Computations
Line 14
A barrier to wait until all the parallel computations are done!!!
Line 15
Call Matrix − Add to add C and T.
51 / 94
130. Explanation
Lines 6 to 13
Calculating the products in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
Using Recursion and Parallel Computations
Line 14
A barrier to wait until all the parallel computations are done!!!
Line 15
Call Matrix − Add to add C and T.
51 / 94
131. Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
132. Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
133. Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
134. Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
135. Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
136. Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
137. Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
138. Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
139. Calculating Complexity of Matrix Multiplication
Work of Matrix Multiplication
The work of T1 (n) of matrix multiplication satisfies the recurrence:
T1 (n) = 8T1
n
2
The sequential product
+ Θ n2
The sequential sum
= Θ n3
.
54 / 94
140. Calculating Complexity of Matrix Multiplication
Span of Matrix Multiplication
T∞ (n) = T∞
n
2
The parallel product
+ Θ (log n)
The parallel sum
= Θ log2
n
This is because:
T∞
n
2 Matrix Multiplication is taking n
2 × n
2 matrices at the same
time because parallelism.
Θ (log n) is the span of the addition of the matrices (Remember, we
are using unlimited processors) which has a critical path of length
log n.
55 / 94
141. Calculating Complexity of Matrix Multiplication
Span of Matrix Multiplication
T∞ (n) = T∞
n
2
The parallel product
+ Θ (log n)
The parallel sum
= Θ log2
n
This is because:
T∞
n
2 Matrix Multiplication is taking n
2 × n
2 matrices at the same
time because parallelism.
Θ (log n) is the span of the addition of the matrices (Remember, we
are using unlimited processors) which has a critical path of length
log n.
55 / 94
142. Calculating Complexity of Matrix Multiplication
Span of Matrix Multiplication
T∞ (n) = T∞
n
2
The parallel product
+ Θ (log n)
The parallel sum
= Θ log2
n
This is because:
T∞
n
2 Matrix Multiplication is taking n
2 × n
2 matrices at the same
time because parallelism.
Θ (log n) is the span of the addition of the matrices (Remember, we
are using unlimited processors) which has a critical path of length
log n.
55 / 94
144. How much Parallelism?
The Final Parallelism in this Algorithm is
T1 (n)
T∞ (n)
= Θ
n3
log2
n
Quite A Lot!!!
57 / 94
145. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
58 / 94
146. Merge-Sort : The Serial Version
We have
Merge − Sort (A, p, r)
Observation: Sort elements in A [p...r]
1 if (p < r) then
2 q = (p+r)/2
3 Merge − Sort (A, p, q)
4 Merge − Sort (A, q + 1, r)
5 Merge (A, p, q, r)
59 / 94
147. Merge-Sort : The Parallel Version
We have
Merge − Sort (A, p, r)
Observation: Sort elements in A [p...r]
1 if (p < r) then
2 q = (p+r)/2
3 spawn Merge − Sort (A, p, q)
4 Merge − Sort (A, q + 1, r) // Not necessary to spawn this
5 sync
6 Merge (A, p, q, r)
60 / 94
148. Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
149. Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
150. Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
151. Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
152. How much Parallelism?
The Final Parallelism in this Algorithm is
T1 (n)
T∞ (n)
= Θ (log n)
NOT NOT A Lot!!!
62 / 94
153. Can we improve this?
We have a problem
We have a bottleneck!!! Where?
Yes in the Merge part!!!
We need to improve that bottleneck!!!
63 / 94
154. Can we improve this?
We have a problem
We have a bottleneck!!! Where?
Yes in the Merge part!!!
We need to improve that bottleneck!!!
63 / 94
158. Then
So that if we insert x between T [q2 − 1] and T [q2]
T p1 · · · q2 − 1 x q2 · · · r1 is sorted
67 / 94
159. Binary Search
It takes a key x and a sub-array T [p..r] and it does
1 If T [p..r] is empty r < p, then it returns the index p.
2 if x ≤ T [p], then it returns p.
3 if x > T [p], then it returns the largest index q in the range
p < q ≤ r + 1 such that T [q − 1] < x.
68 / 94
160. Binary Search
It takes a key x and a sub-array T [p..r] and it does
1 If T [p..r] is empty r < p, then it returns the index p.
2 if x ≤ T [p], then it returns p.
3 if x > T [p], then it returns the largest index q in the range
p < q ≤ r + 1 such that T [q − 1] < x.
68 / 94
161. Binary Search
It takes a key x and a sub-array T [p..r] and it does
1 If T [p..r] is empty r < p, then it returns the index p.
2 if x ≤ T [p], then it returns p.
3 if x > T [p], then it returns the largest index q in the range
p < q ≤ r + 1 such that T [q − 1] < x.
68 / 94
162. Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
163. Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
164. Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
165. Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
166. Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
168. Parallel Merge
Step 4. Recursively merge T [p1..q1 − 1] and T [p2..q2 − 1] and place
result into A [p3..q3 − 1]
71 / 94
169. Parallel Merge
Step 5. Recursively merge T [q1 + 1..r1] and T [q2..r2] and place
result into A [q3 + 1..r3]
72 / 94
170. The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
171. The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
172. The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
173. The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
174. The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
175. The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
176. Explanation
Line 1
Obtain the length of the two arrays to be merged
Line 2: If one is larger than the other
We exchange the variables to work the largest element!!! In this case we
make n1 ≥ n2
Line 4
if n1 == 0 return nothing to merge!!!
74 / 94
177. Explanation
Line 1
Obtain the length of the two arrays to be merged
Line 2: If one is larger than the other
We exchange the variables to work the largest element!!! In this case we
make n1 ≥ n2
Line 4
if n1 == 0 return nothing to merge!!!
74 / 94
178. Explanation
Line 1
Obtain the length of the two arrays to be merged
Line 2: If one is larger than the other
We exchange the variables to work the largest element!!! In this case we
make n1 ≥ n2
Line 4
if n1 == 0 return nothing to merge!!!
74 / 94
179. Explanation
Line 10
It copies T [q1] directly into A [q3]
Line 11 and 12
They are used to recurse using nested parallelism to merge the sub-arrays
less and greater than x.
Line 13
The sync is used to ensure that the subproblems have completed before
the procedure returns.
75 / 94
180. Explanation
Line 10
It copies T [q1] directly into A [q3]
Line 11 and 12
They are used to recurse using nested parallelism to merge the sub-arrays
less and greater than x.
Line 13
The sync is used to ensure that the subproblems have completed before
the procedure returns.
75 / 94
181. Explanation
Line 10
It copies T [q1] directly into A [q3]
Line 11 and 12
They are used to recurse using nested parallelism to merge the sub-arrays
less and greater than x.
Line 13
The sync is used to ensure that the subproblems have completed before
the procedure returns.
75 / 94
182. First the Span Complexity of Parallel Merge: T∞ (n)
Suppositions
n = n1 + n2
What case should we study?
Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)}
We notice then that
Because lines 3-6 n2 ≤ n1
76 / 94
183. First the Span Complexity of Parallel Merge: T∞ (n)
Suppositions
n = n1 + n2
What case should we study?
Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)}
We notice then that
Because lines 3-6 n2 ≤ n1
76 / 94
184. First the Span Complexity of Parallel Merge: T∞ (n)
Suppositions
n = n1 + n2
What case should we study?
Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)}
We notice then that
Because lines 3-6 n2 ≤ n1
76 / 94
185. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
186. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
187. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
188. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
189. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
190. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
191. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
192. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
193. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Knowing that the Binary Search takes
Θ (log n)
We get the span for parallel merge
T∞ (n) = T∞
3n
4
+ Θ (log n)
This can can be solved using the exercise 4.6-2 in the Cormen’s Book
T∞ (n) = Θ log2
n
79 / 94
194. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Knowing that the Binary Search takes
Θ (log n)
We get the span for parallel merge
T∞ (n) = T∞
3n
4
+ Θ (log n)
This can can be solved using the exercise 4.6-2 in the Cormen’s Book
T∞ (n) = Θ log2
n
79 / 94
195. Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Knowing that the Binary Search takes
Θ (log n)
We get the span for parallel merge
T∞ (n) = T∞
3n
4
+ Θ (log n)
This can can be solved using the exercise 4.6-2 in the Cormen’s Book
T∞ (n) = Θ log2
n
79 / 94
196. Calculating Work Complexity of Parallel Merge
Ok!!! We need to calculate the WORK
T1 (n) = Θ (Something)
Thus
We need to calculate the upper and lower bound.
80 / 94
197. Calculating Work Complexity of Parallel Merge
Ok!!! We need to calculate the WORK
T1 (n) = Θ (Something)
Thus
We need to calculate the upper and lower bound.
80 / 94
198. Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
199. Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
200. Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
201. Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
202. Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
203. Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
204. Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
205. Calculating Work Complexity of Parallel Merge
Then
Then, for some α ∈ 1
4, 3
4 , then we have the following recursion for the
Parallel Merge when we have one processor:
T1 (n) = T1 (αn) + T1 ((1 − α) n)
Merge Part
+ Θ (log n)
Binary Search
Remark: α varies at each level of the recursion!!!
82 / 94
206. Calculating Work Complexity of Parallel Merge
Then
Then, for some α ∈ 1
4, 3
4 , then we have the following recursion for the
Parallel Merge when we have one processor:
T1 (n) = T1 (αn) + T1 ((1 − α) n)
Merge Part
+ Θ (log n)
Binary Search
Remark: α varies at each level of the recursion!!!
82 / 94
207. Calculating Work Complexity of Parallel Merge
Then
Then, for some α ∈ 1
4, 3
4 , then we have the following recursion for the
Parallel Merge when we have one processor:
T1 (n) = T1 (αn) + T1 ((1 − α) n)
Merge Part
+ Θ (log n)
Binary Search
Remark: α varies at each level of the recursion!!!
82 / 94
208. Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
209. Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
210. Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
211. Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
212. Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
213. Calculating Work Complexity of Parallel Merge
Now, we have that given 0 < α(1 − α) < 1
We have log (α(1 − α)) < 0
Thus, making n large enough
log n + log (α(1 − α)) > 0 (1)
Then
T1 (n) ≤ c1n − (c2 − c3) log n
84 / 94
214. Calculating Work Complexity of Parallel Merge
Now, we have that given 0 < α(1 − α) < 1
We have log (α(1 − α)) < 0
Thus, making n large enough
log n + log (α(1 − α)) > 0 (1)
Then
T1 (n) ≤ c1n − (c2 − c3) log n
84 / 94
215. Calculating Work Complexity of Parallel Merge
Now, we have that given 0 < α(1 − α) < 1
We have log (α(1 − α)) < 0
Thus, making n large enough
log n + log (α(1 − α)) > 0 (1)
Then
T1 (n) ≤ c1n − (c2 − c3) log n
84 / 94
216. Calculating Work Complexity of Parallel Merge
Now, we choose c2 and c3 such that
c2 − c3 ≥ 0
We have that
T1 (n) ≤ c1n = O(n)
85 / 94
217. Calculating Work Complexity of Parallel Merge
Now, we choose c2 and c3 such that
c2 − c3 ≥ 0
We have that
T1 (n) ≤ c1n = O(n)
85 / 94
218. Finally
Then
T1 (n) = Θ (n)
The parallelism of Parallel Merge
T1 (n)
T∞ (n)
= Θ
n
log2
n
86 / 94
219. Finally
Then
T1 (n) = Θ (n)
The parallelism of Parallel Merge
T1 (n)
T∞ (n)
= Θ
n
log2
n
86 / 94
220. Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
221. Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
222. Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
223. Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
224. Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
225. Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
226. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
227. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
228. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
229. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
230. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
231. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
232. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
233. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
234. Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Parallelism
T1 (n)
T∞ (n)
= Θ
n
log2
n
90 / 94
236. Plotting the T∞
We get the incredible difference when running both algorithms with
an infinite number of processors!!!
92 / 94
237. Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
93 / 94