A superscalar processor executes more than one instruction during a clock cycle by -> simultaneously dispatching multiple instructions to redundant functional units on the processor. -> Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit , a bit shifter, or a multiplier .
RAM – for sequential algorithms CPU step like logical operations, memory accesses, arithmetic operations Model’s advantages – an algorithm’s designer can ignore details of machine the algorithm is executed on
MIMD (Multiple Instruction, Multiple Data)
1) Local A set of n processors each with its own local memory Processors connected to a common communication network Processor can access its own memory directly But also can access other’s processor memory, previously requesting it 2) Modular a)typically the modules (proc and mem) are arranged in the way that the access to memory is uniform for all processors b)the time depends on communication network and memory access pattern 3) PRAM a)processor can access any word of memory in a single step b) it’s just a model
MIMD (Multiple Instruction, Multiple Data)
General-purpose computing on graphics processing units (GPGPU) is a fairly recent trend in computer engineering research. GPUs are co-processors that have been heavily optimized for computer graphics processing. Computer graphics processing is a field dominated by data parallel operations — particularly linear algebra matrix operations.
Each circle represents a node in the search tree which is also a call to the search procedure. A task is created for each node in the tree as it is explored. At any one time, some tasks are actively engaged in expanding the tree further (these are shaded in the figure); others have reached solution nodes and are terminating, or are waiting for their offspring to report back with solutions. The lines represent the channels used to return solutions.
We conclude this chapter by using performance models to compare four different parallel algorithms for the all-pairs shortest-path problem. This is an important problem in graph theory and has applications in communications, transportation, and electronics problems. It is interesting because analysis shows that three of the four algorithms can be optimal in different circumstances, depending on tradeoffs between computation and communication costs.
Transcript
1.
Parallel algorithms Parallel and Distributed Computing Wrocław, 07.05.2010 Paweł Duda
2.
Parallel algorithm – definition A parallel algorithm is an algorithm that has been specifically written for execution on a computer with two or more processors.
9.
Work-depth model How the cost of the algorithm can be calculated? Work - W Depth - D P = W/D – PARALLELISM of the algorithm Picture: Summing 16 numbers on a tree.The total depth (longest chain of dependencies) is 4 and The total work (number of operations) is 15.
12.
General-purpose computing on graphics processing units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) - recent trend
GPUs co-processors
linear algebra matrix operations
Nvidia's Tesla GPGPU card
13.
Matrix multiplication Algorithm: MATRIX_MULTIPLY(A,B) 1 (l,m) := dimensions (A) 2 (m,n) := dimensions (B) 3 in parallel for i ∊ [o..l) do 4 in parallel for j ∊ [0..n) do 5 R ij := sum( { A ik * B kj : k ∊ [0..m) } )
The all-pairs shortest-path problem involves finding the shortest path between all pairs of vertices in a graph.
A graph G=(V,E) comprises a set V of N vertices {v i } , and a set
E ⊆ V x X of edges.
For (v i , v j ) and (v i ,v j ), i ≠ j
Picture: A simple directed graph, G , and its adjacency matrix, A .
17.
Floyd’s algorithm Floyd’s algorithm is a graph analysis algorithm for finding shortest paths in a weighted graph . A single execution of the algorithm will find the shortest paths between all pairs of vertices.
The first parallel Floyd algorithm is based on a one-dimensional, ro w wise domain decomposition of the intermediate matrix I and the output matrix S .
the algorithm can use at most N processors.
Each task has one or more adjacent rows of I and is responsible for performing computation on those rows.
19.
parallel Floyd’s algorithm 1 Parallel version of Floyd's algorithm based on a one-dimensional decomposition of the I matrix. In (a) , the data allocated to a single task are shaded: a contiguous block of rows. In (b) , the data required by this task in the k th step of the algorithm are shaded: its own block and the k th row.
An alternative parallel version of Floyd's algorithm uses a two-dimensional decomposition of the various matrices.
This version allows the use of up to N 2 processors
21.
parallel Floyd’s algorithm 2 Parallel Floyd 2 Parallel version of Floyd's algorithm based on a two-dimensional decomposition of the I matrix. In (a), the data allocated to a single task are shaded: a contiguous submatrix. In (b), the data required by this task in the k th step of the algorithm are shaded: its own block, and part of the k th row and column.
Be the first to comment