• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Lecture 3

Lecture 3






Total Views
Views on SlideShare
Embed Views



2 Embeds 6

http://cs-2009-7b.co.cc 3
http://www.cs-2009-7b.co.cc 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Lecture 3 Lecture 3 Presentation Transcript

    • Parallel Computing Lecture # 3 1
    • Course MaterialText Books:- Computer Architecture & Parallel Processing Kai Hwang, Faye A. Briggs- Advanced Computer Architecture Kai Hwang.Reference Book:- Scalable Computer Architecture 2
    • What is Parallel Processing?It is an efficient form of informationprocessing which emphasizes on theexploitation of the concurrent events in thecomputing process.Efficiency is measured as:- Efficiency = Time / Speed + Accuracy * Always first classify definitions then give properties. 3
    • Types of Concurrent EventsThere are 3 types of concurrent events:-• Parallel Event or Synchronous Event :- (Type of concurrency is parallelism) It may occur in multiple resources during the same interval time. Example Array/Vector Processors CU PE PE PE Based on ALU 4
    • 2. Simultaneous Event or Asynchronous Event :- (Type of concurrency is simultaneity ) It may occur in multiple resources during the same interval time. Example Multiprocessing System3. Pipelined Event or Overlapped Event :- It may occur in overlapped spans. Example Pipelined Processor 5
    • System Attributes versus Performance FactorsThe ideal performance of a computer systemrequires a perfect match between machinecapability and program behavior.Machine capability can be enhanced with betterhardware technology, however program behavioris difficult to predict due to its dependence onapplication and run-time conditions.Below are the five fundamental factors forprojecting the performance of a computer. 6
    • • Clock Rate :- CPU is driven by a clock of constant cycle time (τ). τ = 1/ f (ns)2. CPI :- (Cycles per instruction) As different instructions acquire differentcycles to execute, CPI will be taken as anaverage value for a given instruction set and agiven program mix. 7
    • 3. Execution Time :- Let Ic be InstructionCount or total number of instructions in theprogram. So Execution Time = ? T = Ic × CPI × τNow,CPI = Instruction Cycle = Processor Cycles + Memory Cycles ∴ Instruction cycle = p + m × kwhere m = number of memory references 8
    • P = number of processor cycles k = latency factor (how much the memory is slow w.r.t to CPU)Now let C be Total number of cycles requiredto execute a program.So, C=? C = Ic × CPIAnd the time to execute a program will be T=C×τ 9
    • 4. MIPS Rate :- Ic MIPS rate = 6 T × 105. Throughput Rate:- Number ofprograms executed per unit time. W=? W=1/T OR W = MIPS × 10 6 Ic 10
    • Numerical:- A benchmark program is executed on a 40MHz processor. The benchmark program has the following statistics.Instruction Type Instruction Count Clock Cycle Count Arithmetic 45000 1 Branch 32000 2 Load/Store 15000 2 Floating Point 8000 2 Calculate average CPI,MIPS rate & execution for the above benchmark program. 11
    • Average CPI = C IcC = Total # cycles to execute a whole programIc Total Instruction = 45000 ×1 + 32000×2 + 1500×2 + 8000×2 45000 + 3200 + 15000 + 8000 = 155000 100000CPI = 1.55 Execution Time = C / f 12
    • 6T = 150000 / 40 × 10T = 0.155 / 40T = 3.875 ms 6 MIPS rate = Ic / T × 10 MIPS rate = 25.8 13
    • System Performance FactorsAttributes Ic CPI p m k τInstruction-setArchitectureCompilerTechnologyCPUImplementation& TechnologyMemoryHierarchy 14
    • Practice Problems :-• Do problem number 1.4 from the book Advanced Computer Architecture by Kai Hwang.2. A benchmark program containing 234,000 instructions is executed on a processor having a cycle time of 0.15ns The statistics of the program is given below. Each memory reference requires 3 CPU cycles to complete.Calculate MIPS rate & throughput for the program. 15
    • Instruction Instruction Processor Memory Type Mix Cycles CyclesArithmetic 58 % 2 2Branch 33 % 3 1Load/Store 9% 3 2 16
    • Programmatic Levels of Parallel ProcessingParallel Processing can be challenged in 4programmatic levels:-3. Job / Program Level2. Task / Procedure Level3. Interinstruction Level4. Intrainstruction Level 17
    • 1. Job / Program Level :- It requires thedevelopment of parallel processablealgorithms.The implementation of parallelalgorithms depends on the efficient allocationof limited hardware and software resources tomultiple programs being used to solve a largecomputational problem.Example: Weather forecasting , medicalconsulting , oil exploration etc. 18
    • 2. Task / Procedure Level :- It is conductedamong procedure/tasks within the sameprogram. This involves the decomposition ofthe program into multiple tasks.( for simultaneous execution )3. Interinstruction Level :- Interinstructionlevel is to exploit concurrency amongmultiple instructions so that they can beexecuted simultaneously. Data dependencyanalysis is often performed to reveal parallel- 19
    • -lism amoung instructions. Vectorization maybe desired for scalar operations within DOloops.4. Intrainstruction Level :- Intrainstructionlevel exploits faster and concurrentoperations within each instruction e.g. use ofcarry look ahead and carry save addressinstead of ripple carry address. 20
    • Key Points :-1. Hardware role increases from high to low levels whereas software role increases from low to high levels.2. As highest job level is conducted algorithmically, lowest level is implemented directly by hardware means.3. The trade-off between hardware and software approaches to solve a problem is always a very controversial issue. 21
    • 4. As hardware cost declines and software cost increases , more and more hardware method are replacing the conventional software approaches.Conclusion :- Parallel Processing is acombined field of studies which requires abroad knowledge of and experience with allaspects of algorithms, languages, hardware,software, performance evaluation andcomputing alternatives. 22
    • Parallel Processing in Uniprocessor SystemsA number of parallel processing mechanismshave been developed in uniprocessorcomputers. We identify them in six categorieswhich are described below.1. Multiplicity of Functional Units :- Different ALU functions can be distributed to multiple & specialized functional units which can operate in parallel. 23
    • The CDC-6600 has 10 functional units built inits CPU. IBM 360 / 91 fixed point floating point add / sub mul / div 24
    • 2. Parallelism & Pipelining within the CPU :- Use of carry-lookahead & carry-save address instead of ripple-carry adders. Cascade two 4-bit parallel adders to create an 8-bit parallel adder. 25
    • Ripple-carry Adder :-At each stage the sum bit is not valid untilafter the carry bits in all the preceding stagesare valid.No of bits is directly proportional to the timerequired for valid addition.Problem :- The time required to generate eachcarryout bit in the 8-bit parallel adder is 24ns.Once all inputs to an adder are valid, there is adelay of 32ns until the output sum bit is valid.What is the maximum number of additions per 26
    • second that the adder can perform? 1 addition = 7 × 24 + 32 = 200ns Additions / sec = 1 / 200 -3 9 = 0.5 × 10 × 10 = 5 × 10 6 = 5 million additions / sec 27
    • Practice ProblemAssuming the 32ns delay in producing a validsum bit in the 8-bit parallel adder. Whatmaximum delay in generating a carry out bit isallowed if the adder must be capable of 7performing 10 additions per second. 28
    • Carry-Lookahead Adder :- A 4-bit parallel adder incorporating carry look-ahead. Each full adder is of the type shown in fig. 29
    • Essence & Idea :-To determine & generate the carry input bits for all stages after examining the input bitssimultaneously. C1 = A0 B0 + A0 C0 + B0C0 = A0 B0 + ( A0 + B0 ) C0 C2 = A1B1 + ( A1 + B1 ) C1Carry CarryGenerate Propagate Cn = An-1Bn-1 + ( An-1 + Bn-1 ) Cn-1 30
    • If Ai and Bi both are 1 then Ci+1 = 1. It meansthat the input data itself generating a carry thisis called carry generate. G0 = A0B0 G1 = A1B1 Gn-1 = An-1 Bn-1Ci+1 can be 1 if Ci = 1 and if either Ai or Bi = 1it means that A0 or B0 is used to propagate thecarry. This is called carry propagaterepresented by P0. 31
    • P0 = A0B0 P1 = A1 + Bo Pn-1 = An-1 + Bn-1Now writing the carry equations in terms ofcarry generate and carry propagate. C1 = G0 + P0C0 C2 = G1 + P1C1 = G1 + P1 (G0 + P0C0 ) 32 C =G +PG +PPC
    • C3 = G2 + G2C2 = G2 + P2 ( G1 + P1G0 + P1P0C0 ) C3 = G2 + P2G1 + P2P1G0 + P2P1P0C0Problem :- In each full adder of a 4-bit carrylook-ahead adder, there is a propagation delayof 4ns before the carry propagate & carrygenerate outputs are valid. The delay in eachexternal logic gate is 3ns. Once all inputs to anadder are valid, there is a delay of 6ns beforethe output sum bit is valid. What is themaximum no of additions/sec that the adder 33
    • perform?1 addition = 4ns + 3ns + 3ns + 6ns = 16ns AND gate is OR gate is in in parallel serial Additions / sec = 1 / 16 -3 9 = 62.5 × 10 × 10 = 62.5 × 10 6 = 62.5 million additions / sec 34
    • 3. Overlapping CPU & I/O Operations :- DMA is conducted on a cycle-stealing basis.• CDC-6600 has 10 I/O processors of I/O multiprocessing.• Simultaneous I/O operations & CPU computations can be achieved using separate I/O controllers, channels.4. Use of hierarchical Memory System :- A hierarchal memory system can be used toclose up the speed gap between the CPU & 35
    • main memory because CPU is 1000 timesfaster than memory access.5. Balancing of Subsystem Bandwidth :- Consider the relation t m< tm < tdBandwidth of a System :-Bandwidth of a system is defined as thenumber of operations performed per unit time.Bandwidth of a memory :-The memory bandwidth is the number of words 36
    • accessed per unit time. It is represented by Bm. If ‘W’ is thetotal number of words accessed per memory cycle tm then Bm = W (words / sec ) tmIn case of interleaved memory of M modules, the memoryaccess conflicts may cause delayed access to some of theprocessors requests. The utilized memory bandwidth will be: Bum = Bm (words / sec ) √MProcessor Bandwidth :-Bp :- maximum CPU computation rate. uBp :- utilized processor bandwidth or the no. of 37
    • output results per second. u Bp = Rw (word result) TpRw :- no of word results.Tp :- Total CPU time to generate Rw results.Bd :- Bandwidth of devices. (which is assumed as provided by the vendor).The following relationships have beenobserved between the bandwidths of the majorsubsystems in a high performanceuniprocessor. u B ≥ u B Bm ≥ Bm ≥ p Bp ≥ d 38
    • Due to the unbalanced speeds we need tomatch the processing power of the threesubsystem, we need to match the processingpower of the three subsystems.Two major approaches are described below :-• Bandwidth balancing b/w CPU & memory :- Using fast cache having access time tc = tp.2. Bandwidth balancing b/w memory & I/O :- Intelligent disk controllers can be used to filter out irrelevant data off the tracks. Buffering can be performed by I/O channels. 39
    • 6a. Multiprogramming :-As we know that some computer programs are CPU bound & some are I/O bound.WheneveraProcess P1 is tied up with I/O operations.Thesystem scheduler can switch the CPU toprocess P2.This allows simultaneous executionof several programs in the system.Thisinterleaving of CPU & I/O operations amongseveral programs is called multiprogramming,so the total execution time is reduced. 40
    • 6b. Time sharing :-In multiprogramming, sometimes a highpriority program may occupy the CPU for toolong toallow others to share. This problem can beovercome by using a time-sharing operatingsystem.The concept extends from multiprogram–ming by assigning fixed or variable time slicesto multiple programs. In other words, equalopportunities are given to all programscompeting for the use of CPU. Time sharing is particularly effective whenapplied to a computer system connected to 41