AlgorithmsParallel AlgorithmsBy: Sandeep Kumar PooniaAsst. Professor, Jagannath University, Jaipur1
What is Parallelism in Computers?Parallelism is a digital computer performing morethan one task at the same timeExamples• IO chips : Most computers contain specialcircuits for IO devices which allow some task tobe performed in parallel• Pipelining of Instructions : Some cpus pipelinethe execution of instructions2
Example………• Multiple Arithmetic units (AU) : Some CPUscontain multiple AU so it can perform morethan one arithmetic operation at the sametime.• We are interested in parallelism involvingmore than one CPUs3
Common Terms for Parallelism• Concurrent Processing: A program is dividedinto multiple processes which are run on asingle processor. The processes are time slicedon the single processor• Distributed Processing: A program is dividedinto multiple processes which are run onmultiple distinct machines. The multiplemachines are usual connected by a LANMachines used typically are workstationsrunning multiple programs4
Common Terms for Parallelism….• Parallel Processing: A program is divided intomultiple processes which are run on multipleprocessors. The processors normally:– are in one machine– execute one program at a time– have high speed communications between them5
Parallel Programming• Issues in parallel programming not found insequential programming• Task decomposition, allocation andsequencing• Breaking down the problem into smaller tasks(processes) than can be run in parallel• Allocating the parallel tasks to differentprocessors• Sequencing the tasks in the proper order• Efficiently use the processors6
Parallel Programming• Communication of interim results betweenprocessors: The goal is to reduce the cost ofcommunication between processors. Taskdecomposition and allocation affectcommunication costs• Synchronization of processes: Some processesmust wait at predetermined points for resultsfrom other processes.• Different machine architectures7
Performance Issues• Scalability: Using more nodes should allow a job torun faster, allow a larger job to run in the same time• Load Balancing: All nodes should have the sameamount of work, Avoid having nodes idle whileothers are computing• Bottlenecks: Communication bottlenecks• Too many messages are traveling on the same path• Serial bottlenecks: Communication Message passingis slower than computation8
Parallel MachinesParameters used to describe or classify parallelcomputers:• Type and number of processors• Processor interconnections• Global control• Synchronous vs. asynchronous operation9
Type and number of processors• Massively parallel : Computer systems withthousands of processors• Ex: Parallel Supercomputers CM-5, IntelParagon• Coarse-grained parallelism : Few (~10)processor, usually high powered in system10
Processor interconnectionsParallel computers may be loosely divided intotwo groups:• Shared Memory (or Multiprocessor)• Message Passing (or Multicomputers)11
12A simple parallel algorithmAdding n numbers in parallel
13A simple parallel algorithm• Example for 8 numbers: We start with 4 processors andeach of them adds 2 items in the first step.• The number of items is halved at every subsequent step.Hence log n steps are required for adding n numbers.The processor requirement is O(n) .We have omitted many details from our description of the algorithm.• How do we allocate tasks to processors?• Where is the input stored?• How do the processors access the input as well as intermediateresults?We do not ask these questions while designing sequential algorithms.
14How do we analyze a parallelalgorithm?A parallel algorithms is analyzed mainly in terms of itstime, processor and work complexities.• Time complexity T(n) : How many time steps are needed?• Processor complexity P(n) : How many processors are used?• Work complexity W(n) : What is the total work done by allthe processors? Hence,For our example: T(n) = O(log n)P(n) = O(n)W(n) = O(n log n)
15How do we judge efficiency?• We say A1 is more efficient than A2 if W1(n) = o(W2(n))regardless of their time complexities.For example, W1(n) = O(n) and W2(n) = O(n log n)• Consider two parallel algorithms A1and A2 for the same problem.A1: W1(n) work in T1(n) time.A2: W2(n) work in T2(n) time.If W1(n) and W2(n) are asymptotically the same then A1 is moreefficient than A2 if T1(n) = o(T2(n)).For example, W1(n) = W2(n) = O(n), butT1(n) = O(log n), T2(n) = O(log2 n)
16How do we judge efficiency?• It is difficult to give a more formal definition ofefficiency.Consider the following situation.For A1 , W1(n) = O(n log n) and T1(n) = O(n).For A2 , W 2(n) = O(n log2 n) and T2(n) = O(log n)• It is difficult to say which one is the better algorithm.Though A1 is more efficient in terms of work, A2 runsmuch faster.• Both algorithms are interesting and one may be betterthan the other depending on a specific parallel machine.
17Optimal parallel algorithms• Consider a problem, and let T(n) be the worst-case timeupper bound on a serial algorithm for an input of lengthn.• Assume also that T(n) is the lower bound for solving theproblem. Hence, we cannot have a better upper bound.• Consider a parallel algorithm for the same problem thatdoes W(n) work in Tpar(n) time.The parallel algorithm is work optimal, if W(n) = O(T(n))It is work-time-optimal, if Tpar(n) cannot be improved.
18A simple parallel algorithmAdding n numbers in parallel
19A work-optimal algorithm for adding nnumbersStep 1.• Use only n/log n processors and assign log n numbers toeach processor.• Each processor adds log n numbers sequentially in O(log n)time.Step 2.• We have only n/log n numbers left. We now execute ouroriginal algorithm on these n/log n numbers.• Now T(n) = O(log n)W(n) = O(n/log n x log n) = O(n)
20Why is parallel computing important?• We can justify the importance of parallel computing fortwo reasons.Very large application domains, andPhysical limitations of VLSI circuits• Though computers are getting faster and faster, userdemands for solving very large problems is growing at astill faster rate.• Some examples include weather forecasting, simulationof protein folding, computational physics etc.
21Physical limitations of VLSI circuits• The Pentium III processor uses 180 nano meter (nm) technology, i.e.,a circuit element like a transistor can be etched within180 x 10-9 m.• Pentium IV processor uses 160nm technology.• Intel has recently trialed processors made by using 65nmtechnology.
22How many transistors can we pack?• Pentium III has about 42 million transistors andPentium IV about 55 million transistors.• The number of transistors on a chip is approximatelydoubling every 18 months (Moore’s Law).• There are now 100 transistors for every ant on Earth
23Physical limitations of VLSI circuits• All semiconductor devices are Si based. It is fairly safe to assumethat a circuit element will take at least a single Si atom.• The covalent bonding in Si has a bond length approximately 20nm.• Hence, we will reach the limit of miniaturization very soon.• The upper bound on the speed of electronic signals is 3 x 108m/sec,the speed of light.• Hence, communication between two adjacent transistors will takeapproximately 10-18sec.• If we assume that a floating point operation involves switching of atleast a few thousand transistors, such an operation will take about10-15sec in the limit.• Hence, we are looking at 1000 teraflop machines at the peak of thistechnology. (TFLOPS, 1012 FLOPS)1 flop = a floating point operationThis is a very optimistic scenario.
24Other Problems• The most difficult problem is to control power dissipation.• 75 watts is considered a maximum power output of aprocessor.• As we pack more transistors, the power output goes up andbetter cooling is necessary.• Intel cooled its 8 GHz demo processor using liquid Nitrogen !
25The advantages of parallel computing• Parallel computing offers the possibility of overcoming suchphysical limits by solving problems in parallel.• In principle, thousands, even millions of processors can beused to solve a problem in parallel and today’s fastestparallel computers have already reached teraflop speeds.• Today’s microprocessors are already using several parallelprocessing techniques like instruction level parallelism,pipelined instruction fetching etc.• Intel uses hyper threading in Pentium IV mainly because theprocessor is clocked at 3 GHz, but the memory bus operatesonly at about 400-800 MHz.
26Problems in parallel computing• The sequential or uni-processor computingmodel is based on von Neumann’s storedprogram model.• A program is written, compiled and stored inmemory and it is executed by bringing oneinstruction at a time to the CPU.
27Problems in parallel computing• Programs are written keeping this model in mind.Hence, there is a close match between the softwareand the hardware on which it runs.• The theoretical RAM model captures these conceptsnicely.• There are many different models of parallel computingand each model is programmed in a different way.• Hence an algorithm designer has to keep in mind aspecific model for designing an algorithm.• Most parallel machines are suitable for solving specifictypes of problems.• Designing operating systems is also a major issue.
28The PRAM modeln processors are connected to a shared memory.
29The PRAM model• Each processor should be able to access anymemory location in each clock cycle.• Hence, there may be conflicts in memoryaccess. Also, memory management hardwareneeds to be very complex.• We need some kind of hardware to connectthe processors to individual locations in theshared memory.
Models of parallel computationParallel computational models can bebroadly classified into two categories,• Single Instruction Multiple Data (SIMD)• Multiple Instruction Multiple Data (MIMD)31
Models of parallel computation• SIMD models are used for solvingproblems which have regular structures.We will mainly study SIMD models in thiscourse.• MIMD models are more general and usedfor solving problems which lack regularstructures.32
SIMD modelsAn N- processor SIMD computer has thefollowing characteristics :• Each processor can store both programand data in its local memory.• Each processor stores an identical copyof the same program in its local memory.33
SIMD models• At each clock cycle, each processorexecutes the same instruction from thisprogram. However, the data are differentin different processors.• The processors communicate amongthemselves either through aninterconnection network or through ashared memory.34
Design issues for networkSIMD models• A network SIMD model is a graph. Thenodes of the graph are the processorsand the edges are the links between theprocessors.• Since each processor solves only a smallpart of the overall problem, it is necessarythat processors communicate with eachother while solving the overall problem.35
Design issues for networkSIMD models• The main design issues for network SIMDmodels are communication diameter,bisection width, and scalability.• We will discuss two most popular networkmodels, mesh and hypercube in thislecture.36
Communication diameter• Communication diameter is the diameterof the graph that represents the networkmodel. The diameter of a graph is thelongest distance between a pair of nodes.• If the diameter for a model is d, the lowerbound for any computation on that modelis Ω(d).37
Communication diameter• The data can be distributed in such a waythat the two furthest nodes may need tocommunicate.38
Communication diameterCommunication between two furthestnodes takes Ω(d) time steps.39
Bisection width• The bisection width of a network model isthe number of links to be removed todecompose the graph into two equalparts.• If the bisection width is large, moreinformation can be exchanged betweenthe two halves of the graph and henceproblems can be solved faster.40
Dividing the graph into two parts.Bisection width41
Scalability• A network model must be scalable so thatmore processors can be easily addedwhen new resources are available.• The model should be regular so that eachprocessor has a small number of linksincident on it.42
Scalability• If the number of links is large for eachprocessor, it is difficult to add newprocessors as too many new links have tobe added.• If we want to keep the diameter small, weneed more links per processor. If we wantour model to be scalable, we need lesslinks per processor.43
Diameter and Scalability• The best model in terms of diameter is thecomplete graph. The diameter is 1.However, if we need to add a new node toan n-processor machine, we need n - 1new links.44
Diameter and Scalability• The best model in terms of scalability isthe linear array. We need to add only onelink for a new processor. However, thediameter is n for a machine with nprocessors.45
The mesh architecture• Each internal processor of a 2-dimensionalmesh is connected to 4 neighbors.• When we combine two different meshes,only the processors on the boundary needextra links. Hence it is highly scalable.46
• Both the diameter and bisection width of ann-processor, 2-dimensional mesh isA 4 x 4 meshThe mesh architecture( )O n47
Hypercubes of 0, 1, 2 and 3 dimensionsThe hypercube architecture48
• The diameter of a d-dimensionalhypercube is d as we need to flip at most dbits (traverse d links) to reach oneprocessor from another.• The bisection width of a d-dimensionalhypercube is 2d-1.The hypercube architecture49
• The hypercube is a highly scalablearchitecture. Two d-dimensionalhypercubes can be easily combined toform a d+1-dimensional hypercube.• The hypercube has several variants likebutterfly, shuffle-exchange network andcube-connected cycles.The hypercube architecture50
Adding n numbers in stepsAdding n numbers on the meshn 51
55Complexity Analysis: Given n processorsconnected via a hypercube, S_Sum_Hypercube needslog n rounds to compute the sum. Since n messagesare sent and received in each round, the total number ofmessages is O(n log n).1. Time complexity: O(log n).2. Message complexity: O(n log n).
Classification of the PRAM model• In the PRAM model, processorscommunicate by reading from and writingto the shared memory locations.56
Classification of the PRAMmodel• The power of a PRAM depends on thekind of access to the shared memorylocations.57
Classification of the PRAMmodelIn every clock cycle,• In the Exclusive Read Exclusive Write(EREW) PRAM, each memory locationcan be accessed only by one processor.• In the Concurrent Read Exclusive Write(CREW) PRAM, multiple processor canread from the same memory location, butonly one processor can write.58
Classification of the PRAMmodel• In the Concurrent Read Concurrent Write(CRCW) PRAM, multiple processor canread from or write to the same memorylocation.59
Classification of the PRAMmodel• It is easy to allow concurrent reading.However, concurrent writing gives rise toconflicts.• If multiple processors write to the samememory location simultaneously, it is notclear what is written to the memorylocation.60
Classification of the PRAMmodel• In the Common CRCW PRAM, all theprocessors must write the same value.• In the Arbitrary CRCW PRAM, one of theprocessors arbitrarily succeeds in writing.• In the Priority CRCW PRAM, processorshave priorities associated with them andthe highest priority processor succeeds inwriting.61
Classification of the PRAMmodel• The EREW PRAM is the weakest and thePriority CRCW PRAM is the strongestPRAM model.• The relative powers of the different PRAMmodels are as follows.62
Classification of the PRAMmodel• An algorithm designed for a weakermodel can be executed within the sametime and work complexities on astronger model.63
Classification of the PRAMmodel• We say model A is less powerfulcompared to model B if either:• the time complexity for solving aproblem is asymptotically less in modelB as compared to model A. or,• if the time complexities are the same,the processor or work complexity isasymptotically less in model B ascompared to model A. 64
Classification of the PRAMmodelAn algorithm designed for a stronger PRAMmodel can be simulated on a weaker modeleither with asymptotically more processors(work) or with asymptotically more time.65
Adding n numbers on a PRAMAdding n numbers on a PRAM66
Adding n numbers on a PRAM• This algorithm works on the EREW PRAMmodel as there are no read or writeconflicts.• We will use this algorithm to design amatrix multiplication algorithm on theEREW PRAM.67
For simplicity, we assume that n = 2p for some integer p.Matrix multiplication68
Matrix multiplication• Each can be computed inparallel.• We allocate n processors for computing ci,j.Suppose these processors are P1, P2,…,Pn.• In the first time step, processorcomputes the product ai,m x bm,j.• We have now n numbers and we use theaddition algorithm to sum these n numbersin log n time., , 1 ,i jc i j n, 1mP m n69
Matrix multiplication• Computing each takes nprocessors and log n time.• Since there are n2 such ci,j s, we needoverall O(n3) processors and O(log n)time.• The processor requirement can bereduced to O(n3 / log n). Exercise !• Hence, the work complexity is O(n3), , 1 ,i jc i j n70
Matrix multiplication• However, this algorithm requiresconcurrent read capability.• Note that, each element ai,j (and bi,j)participates in computing n elements fromthe C matrix.• Hence n different processors will try toread each ai,j (and bi,j) in our algorithm.71
For simplicity, we assume that n = 2p for some integer p.Matrix multiplication72
Matrix multiplication• Hence our algorithm runs on the CREWPRAM and we need to avoid the readconflicts to make it run on the EREWPRAM.• We will create n copies of each of theelements ai,j (and bi,j). Then one copy canbe used for computing each ci,j .73
Matrix multiplicationCreating n copies of a number in O (log n)time using O (n) processors on the EREWPRAM.• In the first step, one processor reads thenumber and creates a copy. Hence, thereare two copies now.• In the second step, two processors readthese two copies and create four copies.74
Matrix multiplication• Since the number of copies doubles inevery step, n copies are created in O(logn) steps.• Though we need n processors, theprocessor requirement can be reduced toO (n / log n).75
Matrix multiplication• Since there are n2 elements in the matrix A(and in B), we need O (n3 / log n)processors and O (log n) time to create ncopies of each element.• After this, there are no read conflicts in ouralgorithm. The overall matrix multiplicationalgorithm now take O (log n) time andO (n3 / log n) processors on the EREWPRAM.76
Matrix multiplication• The memory requirement is of coursemuch higher for the EREW PRAM.77
78Using n3 ProcessorsAlgorithm MatMult_CREW/* Step 1 */Forall Pi,j,k, where do in parallelC[i,j,k] = A[i,k]*B[k,j]endfor/* Step 2 */For I =1 to log n doforall Pi,j,k, where do in parallelif (2k modulo 2l)=0 thenC[i,j,2k] C[i,j,2k] + C[i,j, 2k – 2i-1]endifendfor/* The output matrix is stored in locations C[i,j,n], where */endfor
79Complexity Analysis•In the first step, the products are conducted in parallelin constant time, that is, O(1).•These products are summed in O(log n) time duringthe second step. Therefore, the run time is O(log n).•Since the number of processors used is n3, the cost isO(n3 log n).1. Run time, T(n) = O(log n).2. Number of processors, P(n) = n3.3. Cost, C(n) = O(n3 log n).
80Reducing the Number of ProcessorsIn the above algorithm, althoughall the processors were busy during the first step,But not all of them performed addition operations during thesecond step. The second step consists of log n iterations.During the first iteration, only n3/2 processors performedaddition operations,only n3/4 performed addition operations in the seconditeration, and so on.With this understanding, we may be able to use a smallermachine with only n3/log n processors.
81Reducing the Number of Processors1. Each processor Pi,j,k , wherecomputes the sum of log n products. Thisstep will produce (n3/log n) partial sums.2. The sum of products produced in step 1 areadded to produce the resulting matrix asdiscussed before.
82Complexity Analysis1. Run time, T(n) = O(log n).2. Number of processors, P(n) = n3/log n.3. Cost, C(n) = O(n3).