• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Risc processors all syllabus5
 

Risc processors all syllabus5

on

  • 1,354 views

full RISC processors study material

full RISC processors study material

Statistics

Views

Total Views
1,354
Views on SlideShare
1,354
Embed Views
0

Actions

Likes
0
Downloads
23
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Risc processors all syllabus5 Risc processors all syllabus5 Document Transcript

    • An Introduction to RISCProcessorsWe are going to describe how microprocessor manufacturers took a new look atprocessor architectures in the 1980s and started designing simpler but fasterprocessors. We begin by explaining why chip designers turned their backs on theconventional complex instruction set computer (CISC) such at the 68K and the Intel86X families and started producingreduced instruction set computers (RISCs) such asMIPS and the PowerPC. RISC processors have simpler instruction sets than CISCprocessors (although this is a rather crude distinction between these families, as weshall soon see).By the mid 90s many of these so-called RISC processors were considerably morecomplex than some of the CISCs they replaced. That isnt a paradox. The RISCprocessor isnt really a cut-down computer architecture—it represents a new approachto architecture design. In fact, the distinction between CISC and RISC is now soblurred that virtually all processors now have both RISC and CISC features.The RISC RevolutionBefore we look at the ARM, we describe the history and characteristics of RISCarchitecture. From the introduction of the microprocessor in the 1970s to the mid1980s there seems to have been an almost unbroken trend towards more and morecomplex (you might even say Baroque) architectures. Some of these architecturesdeveloped rather like a snowball rolling downhill. Each advance in chip fabricationtechnology allowed designers to add more and more layers to the microprocessorscentral core. Intels 8086 family illustrates this trend particularly well, because Inteltook their original 16-bit processor and added more features in each successivegeneration. This approach to chip design leads to cumbersome architectures andinefficient instruction sets, but it has the tremendous commercial advantage that endusers dont have to pay for new software when they buy the latest reincarnation of amicroprocessor.A reaction against the trend toward greater architectural complexity began at IBMwith their 801 architecture and continued at Berkeley where Patterson and Ditzelcoined the term RISC to describe a new class of architectures that reversed earliertrends in microcomputer design. According to popular wisdom RISC architectures arestreamlined versions of traditional complex instruction set computers. This notion is
    • both misleading and dangerous, because it implies that RISC processors are in someway cruder versions of existing architectures. In brief, RISC architectures re-deploy tobetter effect some of the silicon real estate used to implement complex instructionsand elaborate addressing modes in conventional microprocessors of the 68000 and8086 generation. The mnemonic "RISC" should really stand for regular instruction setcomputer.Two factors influencing the architecture of first- and second-generationmicroprocessors were microprogramming and the desire to help compiler writers byproviding ever more complex instruction sets. The latter is called closing the semanticgap (i.e., reducing the difference between high-level and low-level languages). Bycomplex instructions we mean instruction like MOVE 12(A3,D0),D2 and ADD (A6)-,D3 that carry out multi-step operations in a single machine-level instruction. TheMOVE 12(A3,D0),D2 generates an effective address by adding the contents of A3 tothe contents of D0 plus the literal 12. The resulting address is used to access thesource operand that is loaded into register D2.Microprogramming achieved its highpoint in the 1970s when ferrite core memory hada long access time of 1 ms or more and semiconductor high-speed random accessmemory was very expensive. Quite naturally, computer designers used the slow mainstore to hold the complex instructions that made up the machine-level program. Thesemachine-level instructions are interpreted by microcode in the much fastermicroprogram control store within the CPU. Today, main stores use semiconductormemory with an access time of 50 ns or less, and most of the advantages ofmicroprogramming have evaporated. Indeed, the goal of a RISC architecture is toexecute an instruction in a single machine cycle. A corollary of this statement is thatcomplex instructions cant be executed by RISC architectures. Before we look at RISCarchitectures, we have to describe some of the research that led to the search for betterarchitectures.Instruction UsageComputer scientists carried out extensive research over a decade or more in the late1970s into the way in which computers execute programs. Their studies demonstratedthat the relative frequency with which different classes of instructions are executed isnot uniform and that some types of instruction are executed far more frequently thanother types. Fairclough divided machine-level instructions into eight groups accordingto type and compiled the statistics shown in Table 1. The "mean value of instructionuse" gives the percentage of times that instructions in that group are executedaveraged over both program types and computer architecture. These figures relate toearly 8-bit processors.
    • Table 1 Instruction usage as a function of instruction typeInstruction Group 1 2 3 4 5 6 7 8Mean value of instruction use 45.28 28.73 10.75 5.92 3.91 2.93 2.05 0.4These eight instruction groups in table 1 are: Data movement Program flow control (i.e., branch, call, return) Arithmetic Compare Logical Shift Bit manipulation Input/output and miscellaneousTable 1 convincingly demonstrates that the most common instruction type is the datamovement primitive of the form P: = Q in a high-level language or MOVE P,Q in alow-level language. Similarly, the program flow control group that includes bothconditional and unconditional branches (together with subroutine calls and returns)forms the second most common group of instructions. Taken together, the datamovement and program flow control groups account for 74% of all instructions. Acorollary of this statement is that we can expect a large program to contain only 26%of instructions that are not data movement or program flow control primitives.An inescapable inference from such results is that processor designers might be betteremployed devoting their time to optimizing the way in which machines handleinstructions in groups one and two, than in seeking new powerful instructions that areseldom used. In the early days of the microprocessor, chip manufacturers went out oftheir way to provide special instructions that were unique to their products. Theseinstructions were then heavily promoted by the companys sales force. Today, we cansee that their efforts should have been directed towards the goal of optimizing themost frequently used instructions. RISC architectures have been designed to exploitthe programming environment in which most instructions are data movement orprogram control instructions.Another aspect of computer architecture that was investigated was the optimum sizeof literal operands (i.e., constants). Tanenbaum reported the remarkable result that56% of all constant values lie in the range -15 to +15 and that 98% of all constants liein the range -511 to +511. Consequently, the inclusion of a 5-bit constant field in aninstruction would cover over half the occurrences of a literal. RISC architectures have
    • sufficiently long instruction lengths to include a literal field as part of the instructionthat caters for the majority of literals.Programs use subroutines heavily, and an effective architecture should optimize theway in which subroutines are called, parameters passed to and from subroutines, andworkspace allocated to local variables created by subroutines. Research showed thatin 95% of cases twelve words of storage are sufficient for parameter passing and localstorage. A computer with twelve registers should be able to handle all the operandsrequired by most subroutines without accessing main store. Such an arrangementwould reduces the processor-memory bus traffic associated with subroutine calls.Characteristics of RISC ArchitecturesHaving described the ingredients that go into an efficient architecture, we now look atthe attributes of first generation RISCs before covering RISC architectures in moredetail. The characteristics of an efficient RISC architecture are:RISC processors have sufficient on-chip registers to overcome the worst effects of theprocessor-memory bottleneck. Registers can be accessed more rapidly than off-chipmain store. Although todays processors rely heavily on fast on-chip cache memory toincrease throughput, registers still offer the highest performance.RISC processors have three-address, register-to-register architectures with instructionsin the form OPERATION Ra,Rb,Rc, where Ra, Rb, and Rc are general-purposeregisters.Because subroutines calls are so frequently executed, (some) RISC architectures makeprovision for the efficient passing of parameters between subroutines.Instructions that modify the flow of control (e.g., branch instructions) areimplemented efficiently because they comprise about 20 to 30% of a typical program.RISC processors aim to execute one instruction per clock cycle. This goal imposes alimit on the maximum complexity of instructions.RISC processors dont attempt to implement infrequently used instructions. Complexinstructions waste silicon real-estate and conflict with the requirements of point 8.Moreover, the inclusion of complex instructions increases the time taken to design,fabricate and test a processor.
    • A corollary of point 5 is that an efficient architecture should not bemicroprogrammed, because microprogramming interprets a machine-level instructionby executing microinstructions. In the limit, a RISC processor is close to amicroprogrammed architecture in which the distinction between machine cycle andmicrocode has vanished.An efficient processor should have a single instruction format (or at least very fewformats). A typical CISC processor such as the 68000 has variable-length instructions(e.g., from 2 to 10 bytes). By providing a single instruction format, the decoding of aRISC instruction into its component fields can be performed by a minimum level ofdecoding logic. It follows that a RISCs instruction length should be sufficient toaccommodate the operation code field and one or more operand fields. Consequently,a RISC processor may not utilize memory space as efficiently as does a conventionalCISC microprocessor.Two fundamental aspects of the RISC architecture that we cover later are its registerset and the use of pipelining. Multiple overlapping register windows wereimplemented by the Berkeley RISC to reduce the overhead incurred by transferringparameters between subroutines. Pipelining is a mechanism that permits theoverlapping of instruction execution (i.e., internal operations are carried out inparallel). Many of the features of RISC processors are not new, and have beenemployed long before the advent of the microprocessor. The RISC revolutionhappened when all these performance-enhancing techniques were brought togetherand applied to microprocessor design.The Berkeley RISCAlthough many CISC processors were designed by semiconductor manufacturers, oneof the first RISC processors came from the University of California at Berkeley. TheBerkeley RISC wasnt a commercial machine, although it had a tremendous impact onthe development of later RISC architectures. Figure 1 describes the format of aBerkeley RISC instruction. Each of the 5-bit operand fields (Destination, Source 1,Source 2) permits one of 32 internal registers to be accessed.Figure 1 Format of the Berkeley RISC instruction
    • The single-bit set condition code field, Scc, determines whether the condition codebits are updated after the execution of an instruction. The 14-bit Source 2 field hastwo functions. If the IM bit (immediate) is 0, the Source 2 field specifies one of 32registers. If the IM bit is 1, the Source 2 field provide a 13-bit literal operand.Since five bits are allocated to each operand field, it follows that this RISC has 25 =32 internal registers. This last statement is emphatically not true, since the BerkeleyRISC has 138 user-accessible general-purpose internal registers. The reason for thediscrepancy between the number of registers directly addressable and the actualnumber of registers is due to a mechanism called windowing that gives theprogrammer a view of only a subset of all registers at any instant. Register R0 ishardwired to contain the constant zero. Specifying R0 as an operand is the same asspecifying the constant 0.Register WindowsAn important feature of the Berkeley RISC architecture is the way in which itallocates new registers to subroutines; that is, when you call a subroutine, you getsome new registers. If you can create 12 registers out of thin air when you call asubroutine, each subroutine will have its own workspace for temporary variables,thereby avoiding relatively slow accesses to main store.
    • Although only 12 or so registers are required by each invocation of a subroutine, thesuccessive nesting of subroutines rapidly increases the total number of on-chipregisters assigned to subroutines. You might think that any attempt to dedicate a set ofregisters to each new procedure is impractical, because the repeated calling of nestedsubroutines will require an unlimited amount of storage. Subroutines can indeed benested to any depth, but research has demonstrated that on average subroutines are notnested to any great depth over short periods. Consequently, it is feasible to adopt amodest number of local register sets for a sequence of nested subroutines.Figure 2 provides a graphical representation of the execution of a typical program interms of the depth of nesting of subroutines as a function of time. The trace goes upeach time a subroutine is called and down each time a return is made. If subroutineswere never called, the trace would be a horizontal line. This figure demonstrates isthat even though subroutines may be nested to considerable depths, there are periodsor runs of subroutine calls and returns that do not require a nesting level of greaterthan about five.Figure 2 Depth of subroutine nesting as a function of time
    • A mechanism for implementing local variable work space for subroutines adopted bythe designers of the Berkeley RISC is to support up to eight nested subroutines byproviding on-chip work space for each subroutine. Any further nesting forces the CPUto dump registers to main memory, as we shall soon see.Memory space used by subroutines can be divided into four types:Global space Global space is directly accessible by all subroutines and holdsconstants and data that may be required from any point within the program. Mostconventional microprocessors have only global registers.
    • Local space Local space is private to the subroutine. That is, no other subroutine canaccess the current subroutines local address space from outside the subroutine. Localspace is employed as working space by the current subroutine.Imported parameter space Imported parameter space holds the parameters importedby the current subroutine from its parent that called it. In Berkeley RISC terminologythese are called the high registers.Exported parameter space Exported parameter space holds the parameters exportedby the current subroutine to its child. In RISC terminology these are called the lowregisters.Windows and Parameter PassingOne of the reasons for the high frequency of data movement operations is the need topass parameters to subroutines and to receive them from subroutines.The Berkeley RISC architecture deals with parameter passing by means of multipleoverlapped windows. A window is the set of registers visible to the current subroutine.Figure 3 illustrates the structure of the Berkeley RISCs overlapping windows. Onlythree consecutive windows (i-1, i, i+1) of the 8 windows are shown in Figure 3. Thevertical columns represent the registers seen by the corresponding window. Eachwindow sees 32 registers, but they arent all the same 32 registers.The Berkeley RISC has a special-purpose register called the window pointer, WP, thatindicates the current active window. Suppose that the processor is currently usingthe ith window set. In this case the WP contains the value i. The registers in each ofthe 8 windows are divided into four groups shown in Table 2.Table 2 Berkeley RISC register typesRegister name Register typeR0 to R9 The global register set is always accessible.R10 to R15 Six registers used by the subroutine to receive parameters from its parent an parent.R16 to R25 Ten local registers accessed only by the current subroutine that cannot be ac subroutine.R26 to R31 Six registers used by the subroutine to pass parameters to and from its own called by itself).
    • All windows consist of 32 addressable registers, R0 to R31. A Berkeley RISCinstruction of the form ADD R3,R12,R25 implements [R25] [R3] + [R12], whereR3 lies within the windows global address space, R12 lies within its import from (orexport to) parent subroutine space, and R25 lies within its local address space. RISCarithmetic and logical instructions always involve 32-bit values (there are no 8-bit or16-bit operations).The Berkeley RISCs subroutine call is CALL Rd,<address> and is similar to atypical CISC instruction BSR <address>. Whenever a subroutine is invokedby CALLR Rd,<address>, the contents of the window pointer are incremented by1 and the current value of the program counter saved in register Rd of the newwindow. The Berkeley RISC doesnt employ a conventional stack in external mainmemory to save subroutine return addresses.Figure 3 Berkeley windowed register sets
    • Once a new window has been invoked (in Figure 3 this is window i), the newsubroutine sees a different set of registers to the previous window. Global registers R0to R9 are an exception because they are common to all windows. Window R10 of thechild (i.e., called) subroutine corresponds to (i.e., is the same as) window R26 of thecalling (i.e., parent) subroutine. Suppose you wish to send a parameter to a subroutine.If the parameter is in R10 and you call a subroutine, register R26 in this subroutinewill contain the parameter. There hasnt been a physical transfer of data becauseregister R26 in the current window is simply register R10 in the previous window.Figure 4 Relationship between register number, window number, and register address
    • The physical arrangement of the Berkeley RISCs window system is given in Figure 4.On the left hand side of the diagram is the actual register array that holds all the on-chip general-purpose registers. The eight columns associated with windows 0 to 7demonstrate how each window is mapped onto the physical memory array on the chipand how the overlapping regions are organized. The windows are logically arrangedin a circular fashion so that window 0 follows window 7 and window 7 precedeswindow 0. For example, if the current window pointer is 3 and you access registerR25, location 74 is accessed in the register file. However, if you access register R25when the window pointer is 7, you access location 137.The total number of physical registers required to implement the Berkeley windowedregister set is:10 global + 8 x 10 local + 8 x 6 parameter transfer registers = 138 registers.Window OverflowUnfortunately, the total quantity of on-chip resources of any processor is finite and, inthe case of the Berkeley RISC, the registers are limited to 8 windows. If subroutinesare nested to a depth greater than or equal to 7, window overflow is said to occur, asthere is no longer a new window for the next subroutine invocation. When anoverflow takes place, the only thing left to do is to employ external memory to holdthe overflow data. In practice the oldest window is saved rather than the new windowcreated by the subroutine just called.If the number of subroutine returns minus the number of subroutine calls exceeds 8,window underflow takes place. Window underflow is the converse of windowoverflow and the youngest window saved in main store must be returned to a window.A considerable amount of research was carried out into dealing with window overflowefficiently. However, the imaginative use of windowed register sets in the BerkeleyRISC was not adopted by many of the later RISC architectures. Modern RISCgenerally have a single set of 32 general-purpose registers.RISC Architecture and PipeliningWe now describe pipelining, one of the most important techniques for increasing thethroughput of a digital system that uses the regular structure of a RISC to carry outinternal operations in parallel.
    • Figure 5 illustrates the machine cycle of a hypothetical microprocessor executing anADD P instruction (i.e., [A] [R] + [M(P)], where A is an on-chip general purposeregister and P is a memory location. The instruction is executed in five phases:Instruction fetch Read the instruction from the system memory and increment theprogram counter.Instruction decode Decode the instruction read from memory during the previousphase. The nature of the instruction decode phase is dependent on the complexity ofthe instruction encoding. A regularly encoded instruction might be decoded in a fewnanoseconds with two levels of gating whereas a complex instruction format mightrequire ROM-based look-up tables to implement the decoding.Operand fetch The operand specified by the instruction is read from the systemmemory or an on-chip register and loaded into the CPU.Execute The operation specified by the instruction is carried out.Operand store The result obtained during the execution phase is written into theoperand destination. This may be an on-chip register or a location in external memory.Figure 5 Instruction ExecutionEach of these five phases may take a specific time (although the time taken wouldnormally be an integer multiple of the systems master clock period). Someinstructions require less than five phases; for example, CMP R1,R2 compares R1 andR2 by subtracting R1 from R2 to set the condition codes and does not need an operandstore phase.The inefficiency in the arrangement of Figure 5 is immediately apparent. Considerthe execution phase of instruction interpretation. This phase might take one fifth of aninstruction cycle leaving the instruction execution unit idle for the remaining 80% ofthe time. The same rule applies to the other functional units of the processor, whichalso lie idle for 80% of the time. A technique called instruction pipelining can beemployed to increase the effective speed of the processor by overlapping in time the
    • various stages in the execution of an instruction. In the simplest of terms, a pipelinedprocessor executes instruction i while fetching instruction i + 1 at the same time.The way in which a RISC processor implements pipelining is described in Figure 6.The RISC processor executes the instruction in four steps or phases: instruction fetchfrom external memory, operand fetch, execute, and operand store (were using a 4-stage system because a separate "instruction decode" phase isnt normally necessary).The internal phases take approximately the same time as the instruction fetch, becausethese operations take place within the CPU itself and operands are fetched from andstored in the CPUs own register file. Instruction 1 in Figure 6 begins in time slot 1and is completed at the end of time slot 4.Figure 6 Pipelining and instruction overlap
    • In a non-pipelined processor, the next instruction doesnt begin until the currentinstruction has been completed. In the pipelined system of Figure 6, the instructionfetch phase of instruction 2 begins in time slot 2, at the same time that the operand isbeing fetched for instruction 1. In time slot 3, different phases of instructions 1, 2, and3 are being executed simultaneously. In time slot 4, all functional units of the systemare operating in parallel and an instruction is completed in every time slot thereafter.An n-stage pipeline can increase throughput by up to a factor of n.Pipeline BubblesA pipeline is an ordered structure that thrives on regularity. At any stage in theexecution of a program, a pipeline contains components of two or more instructions atvarying stages in their execution. Consider Figure 7 in which a sequence ofinstructions is being executed in a 4-stage pipelined processor. When the processorencounters a branchinstruction, the following instruction is no longer found at thenext sequential address but at the target address in the branch instruction. Theprocessor is forced to reload its program counter with the value provided by thebranch instruction. This means that all the useful work performed by the pipeline mustnow be thrown away, since the instructions immediately following the branch are notgoing to be executed.When information in a pipeline is rejected or the pipeline is held up by theintroduction of idle states, we say that a bubble has been introduced.Figure 7 The pipeline bubble caused by a branch
    • As we have already stated, program control instructions are very frequent.Consequently, any realistic processor using pipelining must do something toovercome the problem of bubbles caused by instructions that modify the flow ofcontrol (branch, subroutine call and return). The Berkeley RISC reduces the effect ofbubbles by refusing to throw away the instruction following a branch. Thismechanism is called a delayed jump or a branch-and-execute technique because theinstruction immediately after a branch is always executed. Consider the effect of thefollowing sequence of instructions:ADD R1,R2,R3 [R3] [R1] + [R2]JMPX N [PC] [N] Goto address NADD R2,R4,R5 [R5] [R2] + [R4] This is executedADD R7,R8,R9 Not executed because the branch is takenThe processor calculates R5 := R2 + R4 before executing the branch. This sequence ofinstructions is most strange to the eyes of a conventional assembly languageprogrammer, who is not accustomed to seeing an instruction executed after a branchhas been taken.
    • Unfortunately, its not always possible to arrange a program in such a way as toinclude a useful instruction immediately after a branch. Whenever this happens, thecompiler must introduce a no operation instruction, NOP, after the branch and acceptthe inevitability of a bubble. Figure 8 demonstrates how a RISC processor implementsa delayed jump. The branch described in Figure 8 is a computed branch whose targetaddress is calculated during the execute phase of the instruction cycle.Figure 8 Delayed branchAnother problem caused by pipelining is data dependency in which certain sequencesof instructions run into trouble because the current operation requires a result from theprevious operation and the previous operation has not yet left the pipeline. Figure 9demonstrates how data dependency occurs.Figure 9 Data dependency
    • Suppose a programmer wishes to carry out the apparently harmless calculationX := (A + B)AND(A + B - C).Assuming that A, B, C, X, and two temporary values, T1 and T2, are in registers inthe current window, we can write:ADD A,B,T1 [T1] [A] + [B]SUB T1,C,T2 [T2] [T1] - [C]AND T1,T2,X [X] [T1] � [T2]Instruction i + 1 in Figure 9 begins execution during the operand fetch phase of theprevious instruction. However, instruction i + 1 cannot continue on to its operandfetch phase, because the very operand it requires does not get written back to theregister file for another two clock cycles. Consequently a bubble must be introducedin the pipeline while instruction i + 1 waits for its data. In a similar fashion, the logicalAND operation also introduces a bubble as it too requires the result of a previousoperation which is in the pipeline.Figure 10 demonstrates a technique called internal forwarding designed to overcomethe effects of data dependency. The following sequence of operations is to beexecuted. ADD1. [R3] [R1] + [R2] R1,R2,R3 ADD2. [R6] [R4] + [R5] R4,R5,R6 ADD3. [R7] [R3] + [R4] R3,R4,R7 ADD4. [R8] [R7] + [R1] R7,R1,R8Figure 10 Internal forwarding
    • In this example, instruction 3 (i.e., ADD R3,R4,R7) uses an operand generated byinstruction 1 (i.e., the contents of register R3). Because of the intervening instruction2, the destination operand generated by instruction 1 has time to be written into theregister file before it is read as a source operand by instruction 3.Instruction 3 generates a destination operand R7 that is required as a source operandby the next instruction. If the processor were to read the source operand requested byinstruction 4 from the register file, it would see the old value of R7. By means ofinternal forwarding the processor transfers R7 from instruction 3s execution unitdirectly to the execution unit of instruction 4 (see Figure 10).Accessing External Memory in RISCSystemsConventional CISC processors have a wealth of addressing modes that are used inconjunction with memory reference instructions. For example, the 68020 implementsADD D0,-(A5) which adds the contents of D0 to the top of the stack pointed at by A5and then pushes the result on to this stack.In their ruthless pursuit of efficiency, the designers of the Berkeley RISC severelyrestricted the way in which it accesses external memory. The Berkeley RISC permitsonly two types of reference to external memory: a load and a store. All arithmetic and
    • logical operations carried out by the RISC apply only to source and destinationoperands in registers. Similarly, the Berkeley RISC provides a limited number ofaddressing modes with which to access an operand in the main store. Its not hard tofind the reason for these restrictions on external memory accesses—an externalmemory reference takes longer than an internal operation. We now discuss some ofthe general principles of Berkeley RISC load and store instructions.Consider the load register operation of the form LDXW (Rx)S2,Rd that has the effect[Rd] [M([Rx] + S2)]. The operand address is the contents of the memory locationpointed at by register Rx plus offset S2. Figure 11 demonstrates the sequence ofactions performed during the execution of this instruction. During the source fetchphase, register Rx is read from the register file and used to calculate the effectiveaddress of the operand in the execute phase. However, the processor cant progressbeyond the execute phase to the store operand phase, because the operand hasnt beenread from the main store. Therefore the main store must be accessed to read theoperand and a store operand phase executed to load the operand into destinationregister Rd. Because memory accesses introduce bubbles into the pipeline, they areavoided wherever possible.Figure 11 The load operation
    • The Berkeley RISC implements two basic addressing modes: indexed and programcounter relative. All other addressing modes can (and must) be synthesized from thesetwo primitives. The effective address in the indexed mode is given by:EA = [Rx] + S2where Rx is the index register (one of the 32 general purpose registers accessible bythe current subroutine) and S2 is an offset. The offset can be either a general-purposeregister or a 13-bit constant.The effective address in the program counter relative mode is given by:EA = [PC] + S2where PC represents the contents of the program counter and S2 is an offset as above.These addressing modes include quite a powerful toolbox: zero, one or two pointersand a constant offset. If you wonder how we can use an addressing mode without anindex (i.e., pointer) register, remember that R0 in the global register set permanentlycontains the constant 0. For example, LDXW (R12)R0,R3 uses simple addressregister indirect addressing, whereas LDXW (R0)123,R3 uses absolute addressing(i.e., memory location 123).Theres a difference between addressing modes permitted by load and storeoperations. A load instruction permits the second source, S2, to be either animmediate value or a second register, whereas a store instruction permits S2 to be a13-bit immediate value only. This lack of symmetry between the load and storeaddressing modes is because a "load base+index" instruction requires a register filewith two ports, whereas a "store base+index" instruction requires a register filewith three ports. Two-ported memory allows two simultaneous accesses. Three-portedmemory allows three simultaneous accesses and is harder to design.Figure 1 defines just two basic Berkeley RISC instruction formats. The shortimmediate format provides a 5-bit destination, a 5-bit source 1 operand and a 14-bitshort source 2 operand. The short immediate format has two variations: one thatspecifies a 13-bit literal for source 2 and one that specifies a 5-bit source 2 registeraddress. Bit 13 specifies whether the source 2 operand is a 13-bit literal or a 5 bitregister pointer.The long immediate format provides a 19-bit source operand by concatenating the twosource operand fields. Thirteen-bit and 19-bit immediate fields may sound a littlestrange at first sight. However, since 13 + 19 = 32, the Berkeley RISC permits a full
    • 32-bit value to be loaded into a window register in two operations. In the next sectionwe will discover that the ARM processor deals with literals in a different way. Atypical CISC microprocessor might take the same number of instruction bits toperform the same action (i.e., a 32-bit operation code field followed by a 32-bitliteral).The following describes some of the addressing modes that can be synthesized fromthe RISCs basic addressing modes.1. Absolute addressingEA = 13-bit offsetImplemented by setting Rx = R0 = 0, S2 = 13-bit constant.2. Register indirectEA = [Rx]Implemented by setting S2 = R0 = 0.3. Indexed addressingEA = [Rx] + OffsetImplemented by setting S2 = 13-bit constant.4. Two-dimensional byte addressing (i.e., byte array access)EA = [Rx] + [Ry]Implemented by setting S2 = [Ry].This mode is available only for load instructions.Conditional instructions (i.e., branch operations) do not require a destination addressand therefore the five bits, 19 to 23, normally used to specify a destination register areused to specify the condition (one of 16 since bit 23 is not used by conditionalinstructions).Reducing the Branch PenaltyIf were going to reduce the effect of branches on the performance of RISCprocessors, we need to determine the effect of branch instructions on the performanceof the system. Because we cannot know how many branches a given program willcontain, or how likely each branch is to be taken, we have to construct a probabilisticmodel to describe the systems performance. We will make the following assumptions:1. Each non-branch instruction is executed in one cycle2. The probability that a given instruction is a branch is pb
    • 3. The probability that a branch instruction will be taken is pt4. If a branch is taken, the additional penalty is b cyclesIf a branch is not taken, there is no penaltyIf pb is the probability that an instruction is a branch, 1 - pb is the probability that it isnot a branchThe average number of cycles executed during the execution of a program is the sumof the cycles taken for non-branch instructions, plus the cycles taken by branchinstructions that are taken, plus the cycles taken by branch instructions that are nottaken. We can derive an expression for the average number of cycles per instructionas:Tave = (1 - pb)�1 + pb�pt� (1 + b) + pb� (1 - pt) �1 = 1 + pb�pt�b.This expression, 1 + pb�pt�b, tells us that the number of branch instructions, theprobability that a branch is taken, and the overhead per branch instruction allcontribute to the branch penalty. We are now going to examine some of the ways inwhich the value of pb�pt�b can be reduced.Branch PredictionIf we can predict the outcome of the branch instruction before it is executed, we canstart filling the pipeline with instructions from the branch target address (assuming thebranch is going to be taken). For example, if the instruction is BRA N, the processorcan start fetching instructions at locations N, N + 1, N + 2 etc., as soon as the branchinstruction is fetched from memory. In this way, the pipeline is always filled withuseful instructions.This prediction mechanism works well with an unconditional branch like BRA N.Unfortunately, conditional branches pose a problem. Consider a conditional branch ofthe form BCC N (branch to N on carry bit clear). Should the RISC processor make theassumption that the branch will not be taken and fetch instructions in sequence, orshould it make the assumption that the branch will be taken and fetch instruction atthe branch target address N?As we have already said, conditional branches are required to implement varioustypes of high-level language construct. Consider the following fragment of high-levellanguage code.if (J < K) I = I + L;(for T = 1; T <= I; T++)
    • {..}The first conditional operation compares J with K. Only the nature of the problem willtell us whether J is often less than K.The second conditional in this fragment of code is provided by the FOR construct thattests a counter at the end of the loop and then decides whether to jump back to thebody of the construct or to terminate to loop. In this case, you could bet that the loopis more likely to be repeated than exited. Loops can be executed thousands of timesbefore they are exited. Some computers look at the type of conditional branch andthen either fill the pipeline from the branch target if you think that the branch will betaken, or fill the pipeline from the instruction after the branch if you think that it willnot be taken.If we attempt to predict the behavior of a system with two outcomes (branch taken orbranch not taken), there are four possibilities:1. Predict branch taken and branch taken — successful outcome2. Predict branch taken and branch not taken — unsuccessful outcome3. Predict branch not taken and branch not taken — successful outcome4. Predict branch not taken and branch taken — unsuccessful outcomeSuppose we apply a branch penalty to each of these four possible outcomes. Thepenalty is the number of cycles taken by that particular outcome, as table 3demonstrates. For example, if we think that a branch will not be taken and getinstructions following the branch and the branch is actually taken (forcing the pipelineto be loaded with instructions at the target address), the branch penalty in table 3is c cycles.Table 3 The branch penaltyPrediction Result Branch penaltyBranch taken Branch taken aBranch taken Branch not taken bBranch not taken Branch taken cBranch not taken Branch not taken dWe can now calculate the average penalty for a particular system. To do this we needmore information about the system. The first thing we need to know is the probability
    • that an instruction will be a branch (as opposed to any other category of instruction).Assume that the probability that an instruction is a branch is pb. The next thing weneed to know is the probability that the branch instruction will be taken, pt. Finally,we need to know the accuracy of the prediction. Let pc be the probability that a branchprediction is correct. These values can be obtained by observing the performance ofreal programs. Figure 12 illustrates all the possible outcomes of an instruction. Wecan immediately write:(1 - pb) = probability that an instruction is not a branch.(1 - pt) = probability that a branch will not be taken.(1 - pc) = probability that a prediction is incorrect.These equations are obtained by using the principle that if one event or another musttake place, their probabilities must add up to unity. The average branch penalty perbranch instruction is thereforeCave = a � (pbranch_predicted_taken_and_taken) + b � (pbranch_predicted_taken_but_not_taken)+ c � (pbranch_predicted_not_taken_but_taken) + d � (pbranch_predicted_not_taken_and_not_taken)Cave = a � (pt � pc) + b� (1 - pt) � (1 - pc) + c� pt � (1 - pc) + d � (1 - pt) � pcFigure 12 Branch prediction
    • The average number of cycles added due to a branch instruction is Cave � pb= pb � (a � pt � pc + b � (1 - pt) � (1 - pc) + c � pt � (1 - pc) + d � (1 - pt) � pc).We can make two assumptions to help us to simplify this general expression. The firstis that a = d = N (i.e., if the prediction is correct the number of cycles is N). The othersimplification is that b = c = B (i.e., if the prediction is wrong the number of cyclesis B). The average number of cycles per branch instruction is therefore:pb � (N � pt � pc + B � pt � (1 - pc) + B � (1 - pt) � (1 - pc) + N � (1 - pt) � pc)= pb � (N � pc + B � (1 - pc)).This formula can be used to investigate tradeoffs between branch penalties, branchprobabilities and pipeline length. There are several ways of implementing branchprediction (i.e., increasing the value of pc). Two basic approaches are static branch
    • prediction and dynamic branch prediction. Static branch prediction makes theassumption that branches are always taken or never taken. Since observations of realcode have demonstrated that branches have a greater than 50% chance of being taken,the best static branch prediction mechanism would be to fetch the next instructionfrom the branch target address as soon as the branch instruction is detected.A better method of predicting the outcome of a branch is by observing its op-code,because some branch instructions are taken more or less frequently that other branchinstructions. Using the branch op-code to predict that the branch will or will not betaken results in 75% accuracy. An extension of this technique is to devote a bit of theop-code to the static prediction of branches. This bit is set or cleared by the compilerdepending on whether the compiler estimates that the branch is most likely to betaken. This technique provides branch prediction accuracy in the range 74 to 94%.Dynamic branch prediction techniques operate at runtime and use the past behavior ofthe program to predict its future behavior. Suppose the processor maintains a table ofbranch instructions. This branch table contains information about the likely behaviorof each branch. Each time a branch is executed, its outcome (i.e., taken or not taken isused to update the entry in the table. The processor uses the table to determinewhether to take the next instruction from the branch target address (i.e., branchpredicted taken) or from the next address in sequence (branch predicted not taken).Single-bit branch predictors provide an accuracy of over 80 percent and five-bitpredictors provide an accuracy of up to 98 percent. A typical branch predictionalgorithm uses the last two outcomes of a branch to predict its future. If the last twooutcomes are X, the next branch is assumed to lead to outcome X. If the prediction iswrong it remains the same the next time the branch is executed (i.e., two failures areneeded to modify the prediction). After two consecutive failures, the prediction isinverted and the other outcome assumed. This algorithm responds to trends and is notaffected by the occasional single different outcome.Problems1. What are the characteristics of a CISC processor?2. The most frequently executed class of instruction is the data move instruction. Whyis this?3. The Berkeley RISC has a 32-bit architecture and yet provides only a 13-bit literal.Why is this and does it really matter?
    • 4. What are the advantages and disadvantages of register windowing?5. What is pipelining and how does it increase the performance of a computer?6. A pipeline is defined by its length (i.e., the number of stages that can operate inparallel). A pipeline can be short or long. What do you think are the relativeadvantages of longs and short pipelines?7. What is data dependency in a pipelined system and how can its effects beovercome?8. RISC architectures dont permit operations on operands in memory other than loadand store operations. Why?9. The average number of cycles required by a RISC to execute an instruction is givenby Tave = 1 + pb�pt�b.whereThe probability that a given instruction is a branch is pbThe probability that a branch instruction will be taken is ptIf a branch is taken, the additional penalty is b cyclesIf a branch is not taken, there is no penaltyDraw a series of graphs of the average number of cycles per instruction as a functionof pb�pt for b = 1, 2, 3, and 4.10. What is branch prediction and how can it be used to reduce the so-called branchpenalty in a pipelined system?