Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Microprocessors and Microcontrollers Third Year BE Computers Pawar Virendra D. Mo. No.:94235822611/153 MPMC© Pawar Virendra D.
  2. 2. SyllabusEC4813 : Microprocessors and MicrocontrollersMicroprocessors and Microcontrollers Prerequisites :Understanding of Microprocessors, Peripheral Chips, Analogue Sensors, Conversion,Interfacing Techniques.Aim : This course covers the design of hardware and software code using a modernmicrocontroller. It emphasizes on assembly language programming of the microcontrollerincluding device drivers, exception and interrupt handling, and interfacing with higher-level languages.Objectives:1. To exhibit knowledge of the architecture of microcontrollers and apply programcontrol structures to microcontrollers;2. To develop the ability to use assembly language to program a microcontroller anddemonstrate the capability to program the microcontroller to communicate with externalcircuitry using parallel ports;3. To demonstrate the capability to program the microcontroller to communicate withexternal circuitry using serial ports and timer ports.Unit 1 : Introduction to Pentium microprocessor ( 7 Hrs ) Pentium Microprocessor:History ,Feature & Architecture, Pin Description , Functional Description Real Mode,Risc Super Scalar, Pipe lining , Instruction Pairing, Branch Prediction, Inst Data Cache.FPUUnit 2 : Bus Cycles and Memory Organization: ( 7 Hrs ) Bus Cycles & MemoryOrganisation : Init & Configuration, Bus Operations-RST, Bus Operations-RST, Mem/IoOrganisation, Data Transfer Mechanism , 8/16/32 bit Data Bus I, Programmers Model,Register Set, Instru Set , Data Types, InstructionsUnit 3 : Protected Mode: ( 6 Hrs ) Protected Mode :Intro Segmentation, Supp Registers,Rel Int Desc, Mem Man thru Segmentation , Logical to linear translation, protection bysegmentation, Privilege Level protection, related instructions, inter - privilege leveltransfer of control, paging-support registers, descriptors ,linear-physical add trans, TLB,page level protection ,virtual memoryUnit 4 : Multitasking, Interrupts, Exceptions and I/O ( 6 Hrs ) Multitasking,Interrupts, Exception I/O :Multi Tasking Support Reg , Rel Des, Task Switch I/O perBitMap, Virtual Mode, Add Gen, Priv Level, Inst &Reg ,enter/Leaving V86 M, InterruptStructure Real/Prot V86 Mode, I/O Handling, comparison of 3 modes.Unit 5 : 8051 Micro controller ( 7 Hrs ) Family Architecture , ,Data / ProgrammeMemory , Reg set Reg Bank SFR, Ext Data / Mem Programme Mem, Interrupt Structure, Timer Prog ,Serial Port Prog , Misc Features, Min SystemUnit 6 : PIC Micro-Controller ( 7 Hrs ) PIC Micro-Controller :OverView ,Features,Pin Out, Capture /Compare /Pulse width modulation Mode , Block Dia Prog Model, Rest/Clocking, Mem Org, Prog/Data, Flash Eprom, Add Mode/Inst Set Prog , I/o, Interrupt ,Timer, ADCOutcomes: Upon completion of the course, the student should be able to:2/153 MPMC© Pawar Virendra D.
  3. 3. 1. Describe and use the functional blocks utilized in a basic microcontroller basedsystem.2. Describe the programmers model of the CPUs instruction set and various addressingmodes.3. Proficiently use the various instruction set and functional groups, when programming.4. Integrate structured programming techniques and sub-routines into microcontrollerbased hardware topologies.5. Develop I/O port, ADC hardware, and software interfacing techniques.6. Describe the use of sensors, interfacing, and signal conditioning when utilizing themicrocontroller in control and monitor applications. Text Books:1. Antonakos J., "The Pentium Microprocessor", Pearson Education, 2004, 2nd Edition.2. Deshmukh A., "Microcontrollers - Theory and Applications", Tata McGraw-Hill,2004,Reference Books:1. Mazidi M., Gillispie J., " The 8051 Microcontroller and embedded systems", Pearson education, 2002, ISBN - 81-7808-574-72 Intel Pentium Data Sheets3. Ayala K., "The 8051 Microcontroller", Penram International, 1996, ISBN 81 -900828- 4-14. Intel 8 bit Microcontroller manual5. Microchip manual for PIC 16CXX and 16FXX3/153 MPMC© Pawar Virendra D.
  4. 4. INTRODUCTION16-bit Processors and Segmentation (1978)The IA-32 architecture family was preceded by 16-bit processors, the 8086 and 8088.The 8086 has 16-bit registers and a 16-bit external data bus, with 20-bit addressing givinga 1-MByte address space. The 8088 is similar to the 8086 except it has an 8-bit externaldata bus. The 8086/8088 introduced segmentation to the IA-32 architecture. Withsegmentation, a 16-bit segment register contains a pointer to a memory segment of up to64 KBytes. Using four segment registers at a time, 8086/8088 processors are able toaddress up to 256 KBytes without switching between segments. The 20-bit addresses thatcan be formed using a segment register and an additional 16-bit pointer provide a totaladdress range of 1 MByte.The Intel® 286 Processor (1982)The Intel 286 processor introduced protected mode operation into the IA-32 architecture.Protected mode uses the segment register content as selectors or pointers into descriptortables. Descriptors provide 24-bit base addresses with a physical memory size of up to 16Mbytes , support for virtual memory management on a segment swapping basis, and anumber of protection mechanisms. These mechanisms include:• Segment limit checking• Read-only and execute-only segment options• Four privilege levelsThe Intel386™ Processor (1985)The Intel386 processor was the first 32-bit processor in the IA-32 architecture family. Itintroduced 32-bit registers for use both to hold operands and for addressing. The lowerhalf of each 32-bit Intel386 register retains the properties of the 16-bit registers of earliergenerations, permitting backward compatibility. The processor also provides a virtual-8086 mode that allows for even greater efficiency when executing programs created for8086/8088 processors.In addition, the Intel386 processor has support for:• A 32-bit address bus that supports up to 4-GBytes of physical memory• A segmented-memory model and a flat memory model• Paging, with a fixed 4-KByte page size providing a method for virtual memorymanagement• Support for parallel stagesThe Intel486™ Processor (1989)The Intel486™ processor added more parallel execution capability by expanding theIntel386 processor’s instruction decode and execution units into five pipelined stages.Each stage operates in parallel with the others on up to five instructions in differentstages of execution.In addition, the processor added:• An 8-KByte on-chip first-level cache that increased the percent of instructions thatcould execute at the scalar rate of one per clock4/153 MPMC© Pawar Virendra D.
  5. 5. • An integrated x87 FPU• Power saving and system management capabilitiesThe Intel® Pentium® Processor (1993)The introduction of the Intel Pentium processor added a second execution pipeline toachieve superscalar performance (two pipelines, known as u and v, together can executetwo instructions per clock). The on-chip first-level cache doubled, with 8 KBytes devotedto code and another 8 KBytes devoted to data. The data cache uses the MESI protocol tosupport more efficient write-back cache in addition to the write-through cache previouslyused by the Intel486 processor. Branch prediction with an on-chip branch table wasadded to increase performance in looping constructs.In addition, the processor added:• Extensions to make the virtual-8086 mode more efficient and allow for 4-MByte as well as 4-KByte pages• Internal data paths of 128 and 256 bits add speed to internal data transfers• Burst able external data bus was increased to 64 bits• An APIC to support systems with multiple processors• A dual processor mode to support glueless two processor systemsPROCESSOR FEATURES OVERVIEWThe Pentium processor supports the features of previous Intel Architecture processors andprovides significant enhancements including the following:• Superscalar Architecture• Dynamic Branch Prediction• Pipelined Floating-Point Unit• Improved Instruction Execution Time• Separate Code and Data Caches.• Writeback MESI Protocol in the Data Cache• 64-Bit Data Bus• Bus Cycle Pipelining• Address Parity• Internal Parity Checking• Functional Redundancy Checking2 and Lock Step operation2• Execution Tracing• Performance Monitoring• IEEE 1149.1 Boundary Scan• System Management Mode• Virtual Mode Extensions• Upgradable with a Pentium OverDrive processor2• Dual processing support• Advanced SL Power Management Features• Fractional Bus Operation• On-Chip Local APIC Device• Functional Redundancy Checking and Lock Step operation5/153 MPMC© Pawar Virendra D.
  6. 6. • Support for the Intel 82498/82493 and 82497/82492 cache chipset products• Upgradability with a Pentium OverDrive processor• Split line accesses to the code cacheCOMPONENT INTRODUCTIONThe application instruction set of the Pentium processor family includes the completeinstruction set of existing Intel Architecture processors to ensure backward compatibility,with extensions to accommodate the additional functionality of the Pentium processor.All application software written for the Intel386™ and Intel486™ microprocessors willrun on the Pentium processor without modification. The on-chip memory managementunit (MMU) is completely compatible with the Intel386 and Intel486 CPUs.The two instruction pipelines and the floating-point unit on the Pentium processor arecapable of independent operation. Each pipeline issues frequently used instructions in asingle clock. Together, the dual pipes can issue two integer instructions in one clock, orone floating-point instruction (under certain circumstances, 2 floating-point instructions)6/153 MPMC© Pawar Virendra D.
  7. 7. in one clock. Branch prediction is implemented in the Pentium processor. To support this,the Pentium processor implements two prefetch buffers, one to prefetch code in a linearfashion, and one that prefetches code according to the Branch Target Buffer (BTB) so theneeded code is almost always prefetched before it is needed for execution.The Pentium processor includes separate code and data caches integrated on chip to meetits performance goals.. The caches on the Pentium processor are each 8 Kbytes in sizeand 2-way set-associative. Each cache has a dedicated Translation Lookaside Buffer(TLB) to translate linear addresses to physical addresses. The Pentium processor datacache is configurable to be writeback or writethrough on a line-by-line basis and followsthe MESI protocol. The data cache tags are triple ported to support two data transfers andan inquire cycle in the same clock. The code cache is an inherently write protected cache.The code cache tags of the Pentium processor are also triple ported to support snoopingand split-line accesses.The Pentium processor has a 64-bit data bus. Burst read and burst writeback cycles aresupported by the Pentium processor. In addition, bus cycle pipelining has been added toallow two bus cycles to be in progress simultaneously. The Pentium processor MemoryManagement Unit contains optional extensions to the architecture which allow 4 MBpage sizes.The Pentium processor has added significant data integrity and error detection capability.Data parity checking is still supported on a byte-by-byte basis. Address parity checking,and internal parity checking features have been added along with a new exception, themachine check exception.The Pentium processor has implemented functional redundancy checking to providemaximum error detection of the processor and the interface to the processor. Whenfunctional redundancy checking is used, a second processor, the “checker” is used toexecute in lock step with the “master” processor. The checker samples the master’soutputs and compares those values with the values it computes internally, and asserts anerror signal if a mismatch occurs. The Pentium processor with MMX technology does notsupport functional redundancy checking.As more and more functions are integrated on chip, the complexity of board level testingis increased. To address this, the Pentium processor has increased test and debugcapability by implementing IEEE Boundary Scan (Standard 1149.1). Systemmanagement mode has been implemented along with some extensions to the SMMarchitecture.Enhancements to the Virtual 8086 mode have been made to increase performancebyreducing the number of times it is necessary to trap to a Virtual 8086 monitor. includingthe two instruction pipelines, the “u” pipe and the “v” pipe. The u-pipe can execute allinteger and floating-point instructions. The v-pipe can execute simple integer instructionsand the FXCH floating-point instruction.7/153 MPMC© Pawar Virendra D.
  8. 8. The separate code and data caches are shown. The data cache has two ports, one for eachof the two pipes (the tags are triple ported to allow simultaneous inquire cycles). The datacache has a dedicated to translate linear addresses to the physical addresses used by thedata cache.The code cache, branch target buffer and prefetch buffers are responsible for getting rawinstructions into the execution units of the Pentium processor. Instructions are fetchedfrom the code cache or from the external bus. Branch addresses are remembered by thebranch target buffer. The code cache TLB translates linear addresses to physicaladdresses used by the code cache.The decode unit contains two parallel decoders which decode and issue up to the nexttwo sequential instructions into the execution pipeline. The control ROM contains themicrocode which controls the sequence of operations performed by the processor. Thecontrol unit has direct control over both pipelines.The Pentium processor contains a pipelined floating-point unit that provides a significantfloating-point performance advantage over previous generations of Intel Architecture-based processors.The Pentium processor includes features to support multi-processor systems, namely anon chip Advanced Programmable Interrupt Controller (APIC). This APICimplementation supports multiprocessor interrupt management (with symmetric interruptdistribution across all processors), multiple I/O subsystem support, 8259A compatibility,and inter-processor interrupt support.The dual processor configuration allows two Pentium processors to share a single L2cache for a low-cost symmetric multi-processor system. The two processors appear to thesystem as a single Pentium processor. Multiprocessor operating systems properlyschedule computing tasks between the two processors. This scheduling of tasks istransparent to software applications and the end-user. Logic built into the processorssupport a “glueless” interface for easy system design. Through a private bus, the twoPentium processors arbitrate for the external bus and maintain cache coherency. ThePentium processor can also be used in a conventional multi-processor system in whichone L2 cache is dedicated to each processor.The Pentium processor is produced on Intel’s advanced silicon technology. The Pentiumprocessor also includes SL enhanced power management features. When the clock to thePentium processor is stopped, power dissipation is virtually eliminated. The low VCCoperating voltages and SL enhanced power management features make the Pentiumprocessor a good choice for energy-efficient desktop designs.8/153 MPMC© Pawar Virendra D.
  9. 9. PIN DESCRIPTIONSymbol Type Name and FunctionA31-A3 I/O As outputs, the address lines of the processor along with the byte enables define the physical area of memory or I/O accessed. The external system drives the inquire address to the processor on A31-A5.D63-D0 I/O These are the 64 data lines for the processor. Lines D7-D0 define the least significant byte of the data bus; lines D63-D56 define the most significant byte of the data bus. When the CPU is driving the data lines, they are driven during the T2, T12, or T2P clocks for that cycle. During reads, the CPU samples the data bus when BRDY# is returned.ADS# O The address status indicates that a new valid bus cycle is currently being driven by the Pentium processorBE7#-BE5# O The byte enable pins are used to determine which bytes mustBE4#-BE0# I/O be written to external memory, or which bytes were requested by the CPU for the current cycle. The byte enables are driven in the same clock as the address lines (A31-3).BOFF# I The backoff input is used to abort all outstanding bus cycles that have not yet completed. In response to BOFF#, the Pentium processor will float all pins normally floated during bus hold in the next clock. Theprocessor remains in bus hold until BOFF# is negated, at which time the Pentium processor restarts the aborted bus cycle(s) in their entirety.BRDY# I The burst ready input indicates that the external system has presented valid data on the data pins in response to a read or that the external system has accepted the Pentium processor data in response to a write request. This signal is sampled in the T2, T12 and T2P bus states.CACHE# O For Pentium processor initiated cycles the cache pin indicates internal cacheability of the cycle (if a read), and indicates a burst write back cycle (if a write). If this pin is driven inactive during a read cycle, the Pentium processor will not cache the returned data, regardless of the state of the KEN# pin. This pin is also used to determine the cycle length (number of transfers in the cycle).CPUTYP I CPU type distinguishes the Primary processor from the Dual processor. In a single processor environment, or when the Pentium processor is acting as the Primary processor in a dual processing system, CPUTYP should be strapped to VSS. The Dual processor should have CPUTYP strapped to VCC. For the Pentium OverDrive processor, CPUTYP will be used to determine whether the bootup handshake protocol will be used (in a dual socket system) or not (in a single socket system).FLUSH# I When asserted, the cache flush input forces the Pentium processor to write back all modified lines in the data cache9/153 MPMC© Pawar Virendra D.
  10. 10. and invalidate its internal caches. A Flush Acknowledge special cycle will be generated by the Pentium processor indicating completion of the write back and invalidation. If FLUSH# is sampled low when RESET transitions from high to low, tristate test mode is entered. If two Pentium processor are operating in dual processing mode and FLUSH# is asserted, the Dual processor will perform a flush first (without a flush acknowledge cycle), then the Primary processor will perform a flush followed by a flush acknowledge cycle. NOTE: If the FLUSH# signal is asserted in dual processing mode, it must be deasserted at least one clock prior to BRDY# of the FLUSH Acknowledge cycle to avoid DP arbitration problems.FRCMC# I The functional redundancy checking master/checker mode input is used to determine whether the Pentium processor is configured in master mode or checker mode. When configured as a master, the Pentium processor drives its output pins as required by the bus protocol. When configured as a checker, the Pentium processor tristates all outputs (except IERR# and TDO) and samples the output pins. The configuration as a master/checker is set after RESET and may not be changed other than by a subsequent RESET.HOLD I In response to the bus hold request, the Pentium processor will float most of its output and input/output pins and assert HLDA after completing all outstanding bus cycles. The Pentium processor will maintain its bus in this state until HOLD is de-asserted. HOLD is not recognized during LOCK cycles. The Pentium processor will recognize HOLD during reset.HOLDA O The bus hold acknowledge pin goes active in response to a hold request driven to the processor on the HOLD pin. It indicates that the Pentium processor has floated most of the output pins and relinquished the bus to another local bus master. When leaving bus hold, HLDA will be driven inactive and the Pentium processor will resume driving the bus. If the Pentium processor has a bus cycle pending, it will be driven in the same clock that HLDA is de-asserted.INIT I The Pentium processor initialization input pin forces the Pentium processor to begin execution in a known state. The processor state after INIT is the same as the state after RESET except that the internal caches, write buffers, and floating point registers retain the values they had prior to INIT. INIT may NOT be used in lieu of RESET after power-up. If INIT is sampled high when RESET transitions from high to low, the Pentium processor will perform built-in self test prior to the start of program execution.10/153 MPMC© Pawar Virendra D.
  11. 11. INV I The invalidation input determines the final cache line state (S or I) in case of an inquire cycle hit. It is sampled together with the address for the inquire cycle in the clock EADS# is sampled active.KEN# I The cache enable pin is used to determine whether the current cycle is cacheable or not and is consequently used to determine cycle length. When the Pentium processor generates a cycle that can be cached (CACHE# asserted) and KEN# is active, the cycle will be transformed into a burst line fill cycle.LOCK# O The bus lock pin indicates that the current bus cycle is locked. The Pentium processor will not allow a bus hold when LOCK# is asserted (but AHOLD and BOFF# are allowed). LOCK# goes active in the first clock of the first locked bus cycle and goes inactive after the BRDY# is returned for the last locked bus cycle. LOCK# is guaranteed to be de-asserted for at least one clock between back-to-back locked cycles.NA# I An active next address input indicates that the external memory system is ready to accept a new bus cycle although all data transfers for the current cycle have not yet completed. The Pentium processor will issue ADS# for a pending cycle two clocks after NA# is asserted. The Pentium processor supports up to 2 outstanding bus cycles.RESET I RESET forces the Pentium processor to begin execution at a known state. All the Pentium processor internal caches will be invalidated upon the RESET. Modified lines in the data cache are not written back. FLUSH#, FRCMC# and INIT are sampled when RESET transitions from high to low to determine if tristate test mode or checker mode will be entered, or if BIST will be run.11/153 MPMC© Pawar Virendra D.
  12. 12. REAL MODERISCA Complex Instruction Set Computer (CISC) provides a large and powerful range ofinstructions, which is less flexible to implement. For example, the 8086 microprocessorfamily has these instructions: JA Jump if Above JAE Jump if Above or Equal JB Jump if BelowBy contrast, the Reduced Instruction Set Computer (RISC) concept is to identify the sub-components and use those. As these are much simpler, they can be implemented directlyin silicon, so will run at the maximum possible speed. Nothing is translatedMost modern CISC processors, such as the Pentium, uses a fast RISC core with aninterpreter sitting between the core and the instruction. So when you are runningWindows95 on a PC, it is not that much different to trying to get W95 running on thesoftware PC emulator. Just imagine the power hidden inside the Pentium... .This is not to say that CISC processors cannot have a large number of registers, some do.However for its use, a typical RISC processor requires more registers to give it additionalflexibility. Gone are the days when you had two general purpose registers and anaccumulator.One thing RISC does offer, though, is register independenceThe 8086 offers you fourteen registers, but with caveats:The first four (A, B, C, and D) are Data registers (a.k.a. scratch-pad registers). They are16bit and accessed as two 8 bit registers, thus register A is really AH (A, high-order byte)and AL (A low-order byte). These can be used as general purpose registers, but they canalso have dedicated functions - Accumulator, Base, Count, and Data.The advantages of RISC against CISC are those today: • RISC processors are much simpler to build, by this again results in the following advantages: o easier to build, i.e. you can use already existing production facilities o much less expensive, just compare the price of a XScale with that of a Pentium III at 1 GHz... o less power consumption, which again gives two advantages: much longer use of battery driven devices no need for cooling of the device, which again gives to advantages:12/153 MPMC© Pawar Virendra D.
  13. 13. smaller design of the whole device no noiseRISC processors are much simpler to program which doesnt only help the assemblerprogrammer, but the compiler designer, too. Youll hardly find any compiler which usesall the functions of a Pentium III optimallySUPER SCALARA superscalar CPU architecture implements a form of parallelism called instructionlevel parallelism within a single processor. It therefore allows faster CPU throughput thanwould otherwise be possible at a given clock rate. A superscalar processor executes morethan one instruction during a clock cycle by simultaneously dispatching multipleinstructions to redundant functional units on the processor. Each functional unit is not aseparate CPU core but an execution resource within a single CPU such as an arithmeticlogic unit, a bit shifter, or a multiplier.While a superscalar CPU is typically also pipelined, pipelining and superscalararchitecture are considered different performance enhancement techniques.The superscalar technique is traditionally associated with several identifyingcharacteristics (within a given CPU core): • Instructions are issued from a sequential instruction stream • CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time) • The CPU accepts multiple instructions per clock cycleThe simplest processors are scalar processors. Each instruction executed by a scalarprocessor typically manipulates one or two data items at a time. By contrast, eachinstruction executed by a vector processor operates simultaneously on many data items.An analogy is the difference between scalar and vector arithmetic. A superscalarprocessor is sort of a mixture of the two. Each instruction processes one data item, butthere are multiple redundant functional units within each CPU thus multiple instructionscan be processing separate data items concurrently.Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, andallowing it to keep the multiple functional units in use at all times. This has becomeincreasingly important when the number of units increased. While early superscalarCPUs would have two ALUs and a single FPU, a modern design such as the PowerPC970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffectiveat keeping all of these units fed with instructions, the performance of the system willsuffer.13/153 MPMC© Pawar Virendra D.
  14. 14. A superscalar processor usually sustains an execution rate in excess of one instruction permachine cycle. But merely processing multiple instructions concurrently does not makean architecture superscalar, since pipelined, multiprocessor or multi-core architecturesalso achieve that, but with different methods.In a superscalar CPU the dispatcher reads instructions from memory and decides whichones can be run in parallel, dispatching them to redundant functional units containedinside a single CPU. Therefore a superscalar processor can be envisioned having multipleparallel pipelines, each of which is processing instructions simultaneously from a singleinstruction thread.Existing binary executable programs have varying degrees of intrinsic parallelism. Insome cases instructions are not dependent on each other and can be executedsimultaneously. In other cases they are inter-dependent: one instruction impacts eitherresources or results of the other. The instructions a = b + c; d = e + f can be run inparallel because none of the results depend on other calculations. However, theinstructions a = b + c; b = e + f might not be runnable in parallel, depending on theorder in which the instructions complete while they move through the units.When the number of simultaneously issued instructions increases, the cost of dependencychecking increases extremely rapidly. This is exacerbated by the need to checkdependencies at run time and at the CPUs clock rate. This cost includes additional logicgates required to implement the checks,14/153 MPMC© Pawar Virendra D.
  15. 15. PIPELINE AND INSTRUCTION FLOWThe integer instructions traverse a five stage pipeline in the Pentium processorThe pipeline stages are as follows:PF PrefetchD1 Instruction DecodeD2 Address GenerateEX Execute - ALU and Cache AccessWB WritebackThe Pentium processor is a superscalar machine, built around two general purpose integerpipelines and a pipelined floating-point unit capable of executing two instructions inparallel. Both pipelines operate in parallel allowing integer instructions to execute in asingle clock in each pipeline. Figure depicts instruction flow in the Pentium processor.The pipelines in the Pentium processor are called the “u” and “v” pipes and the processof issuing two instructions in parallel is termed “pairing.” The u-pipe can execute anyinstruction in the Intel architecture, while the v-pipe can execute “simple” instructions asdefined in the “Instruction Pairing Rules” section of this chapter. When instructions arepaired, the instruction issued to the v-pipe is always the next sequential instruction afterthe one issued to the u-pipe. Pentium® Processor Pipeline ExecutionThe Pentium processor pipeline has been optimized to achieve higher throughputcompared to previous generations of Intel Architecture processors.The first stage of the pipeline is the Prefetch (PF) stage in which instructions areprefetched from the on-chip instruction cache or memory. Because the Pentium processorhas separate caches for instructions and data, prefetches do not conflict with datareferences for access to the cache. If the requested line is not in the code cache, a memoryreference is made. In the PF stage of the Pentium processor, two independent pairs ofline-size (32-byte) prefetch buffers operate in conjunction with the Branch TargetBuffer. This allows one prefetch buffer to prefetch instructions sequentially, while theother prefetches according to the branch target buffer predictions. The pipeline stage after15/153 MPMC© Pawar Virendra D.
  16. 16. the PF stage in the Pentium processor is Decode 1 (D1) in which two parallel decodersattempt to decode and issue the next two sequential instructions. The decoders determinewhether one or two instructions can be issued contingent upon the “Instruction PairingRules.” The Pentium processor requires an extra D1 clock to decode instructionprefixes. Prefixes are issued to the u-pipe at the rate of one per clock without pairing.After all prefixes have been issued, the base instruction will then be issued and pairedaccording to the pairing rules.The D1 stage is followed by Decode2 (D2) in which addresses of memory residentoperands are calculated. In instructions containing both a displacement and an immediate,or instructions containing a base and index addressing mode , The Pentium processorremoves both of these restrictions and is able to issue instructions in these categories in asingle clock.The Pentium processor uses the Execute (EX) stage of the pipeline for both ALUoperations and for data cache access; therefore those instructions specifying both an ALUoperation and a data cache access will require more than one clock in this stage. In EX allu-pipe instructions and all v-pipe instructions except conditional branches are verified forcorrect branch prediction. Microcode is designed to utilize both pipelines and thus thoseinstructions requiring microcode execute faster.The final stage is Writeback (WB) where instructions are enabled to modify processorstate and complete execution. In this stage, v-pipe conditional branches are verified forcorrect branch prediction. During their progression through the pipeline, instructions maybe stalled due to certain conditions. Both the u-pipe and v-pipe instructions enter andleave the D1 and D2 stages in unison. When an instruction in one pipe is stalled, thenthe instruction in the other pipe is also stalled at the same pipeline stage. Thus both the u-pipe and the v-pipe instructions enter the EX stage in unison. Once in EX if the u-pipeinstruction is stalled, then the v-pipe instruction (if any) is also stalled. If the v-pipeinstruction is stalled then the instruction paired with it in the u-pipe is not allowed toadvance. No successive instructions are allowed to enter the EX stage of either pipelineuntil the instructions in both pipelines have advanced to WB.INSTRUCTION PREFETCHIn the Pentium processor PF stage, two independent pairs of line-size (32-byte) prefetchbuffers operate in conjunction with the branch target buffer. Only one prefetch bufferactively requests prefetches at any given time. Prefetches are requested sequentially untila branch instruction is fetched. When a branch instruction is fetched, the branch targetbuffer (BTB) predicts whether the branch will be taken or not. If the branch is predictednot taken, prefetch requests continue linearly. On a predicted taken branch the otherprefetch buffer is enabled and begins to prefetch as though the branch was taken. If abranch is discovered mis-predicted, the instruction pipelines are flushed and prefetchingactivity starts over.Integer Instruction Pairing RulesThe Pentium processor can issue one or two instructions every clock. In order to issuetwo instructions simultaneously they must satisfy the following conditions:• Both instructions in the pair must be “simple” as defined below16/153 MPMC© Pawar Virendra D.
  17. 17. Simple instructions are entirely hardwired; they do not require any microcode controland, in general, execute in one clock. The exceptions are the ALU mem, reg and ALUreg, mem• There must be no read-after-write or write-after-write register dependencies betweenthem• Neither instruction may contain both a displacement and an immediate• Instructions with prefixes can only occur in the u-pipe.• Instruction prefixes are treated as separate 1-byte instructions. Sequencing hardware isused to allow them to function as simple instructions. The following integer instructionsareconsidered simple and may be paired:1. mov reg, reg/mem/imm2. mov mem, reg/imm3. alu reg, reg/mem/imm4. alu mem, reg/imm5. inc reg/mem6. dec reg/mem7. push reg/mem8. pop reg9. lea reg,mem10. jmp/call/jcc near11. nop12. test reg, reg/mem13. test acc, immIn addition, conditional and unconditional branches may be paired only if they occur asthe second instruction in the pair. They may not be paired with the next sequentialinstruction. Also, SHIFT/ROT by 1 and SHIFT by imm may pair as the first instructionin a pair. The register dependencies that prohibit instruction pairing include implicitdependencies via registers or flags not explicitly encoded in the instruction. For example,an ALU instruction in the u-pipe (which sets the flags) may not be paired with an ADC oran SBB instruction in the v-pipe. There are two exceptions to this rule. The first is thecommonly occurring sequence of compare and branch which may be paired. The secondexception is pairs of pushes or pops. Although these instructions have an implicitdependency on the stack pointer, special hardware is included to allow these commonoperations to proceed in parallel. Although in general two paired instructions mayproceed in parallel independently, there is an exception for paired “read-modify-write”instructions. Read-modify-write instructions are ALU operations with an operand inmemory. When two of these instructions are paired there is a sequencing delay of twoclocks in addition to the three clocks required to execute the individual instructions.Although instructions may execute in parallel their behavior as seen by the programmeris exactly the same as if they were executed sequentially.17/153 MPMC© Pawar Virendra D.
  18. 18. BRANCH PREDICTION Branch Target Buffer (BTB)The Pentium processor uses a Branch Target Buffer (BTB) to predict the outcome ofbranch instructions which minimizes pipeline stalls due to prefetch delays.The Pentium processor accesses the BTB with the address of the instruction in the D1stage. It contains a Branch prediction state machine with four states: (1) strongly nottaken, (2) weakly not taken, (3) weakly taken, and (4) strongly taken. In the event of acorrect prediction, a branch will execute without pipeline stalls or flushes. Brancheswhich miss the BTB are assumed to be not taken. Conditional and unconditional nearbranches and near calls execute in 1 clock and may be executed in parallel with otherinteger instructions. A mispredicted branch (whether a BTB hit or miss) or a correctlypredicted branch with the wrong target address will cause the pipelines to be flushed andthe correct target to be fetched. Incorrectly predicted unconditional branches will incur anadditional three clock delay, incorrectly predicted conditional branches in the u-pipe willincur an additional three clock delay, and incorrectly predicted conditional branches inthe v-pipe will incur an additional four clock delay. NT H: History T H: 11 H: 10 P: Prediction T P: T P: T T: Taken T NT: Not Taken T NT T T NT H: 00 H: 01 T P: NT P: T NT TThe benefits of branch prediction are illustrated in the following example. Consider thefollowing loop from a benchmark program for computing prime numbers:for(k=i+prime;k<=SIZE;k+=prime)flags[k]=FALSE;A popular compiler generates the following assembly code:(prime is allocated to ecx, k is allocated to edx, and al contains the value FALSE)inner_loop:mov byte ptr flags[edx],aladd edx,ecxcmp edx, SIZEjle inner_loopEach iteration of this loop will execute in 6 clocks on the Intel486 CPU. On the Pentiumprocessor, the mov is paired with the add; the cmp with the jle. With branchprediction, each loop iteration executes in 2 clocks.18/153 MPMC© Pawar Virendra D.
  19. 19. CACHEON-CHIP CACHESThe Pentium processor implements two internal caches for a total integrated cache size of16 Kbytes: an 8 Kbyte data cache and a separate 8 Kbyte code cache. These caches aretransparent to application software to maintain compatibility with previous The datacache fully supports the MESI (modified/exclusive/shared/invalid) writeback cacheconsistency protocol. The code cache is inherently write protected to prevent code frombeing inadvertently corrupted, and as a consequence supports a subset of the MESIprotocol, the S (shared) and I (invalid) states. The caches have been designed formaximum flexibility and performance. The data cache is configurable as writeback orwritethrough on a line-by-line basis. Memory areas can be defined as non-cacheable bysoftware and external hardware. Cache writeback and invalidations can be initiated byhardware or software. Protocols for cache consistency and line replacement areimplemented in hardware, easing system devise On the Pentium processor , each of thecaches are 8 Kbytes in size and each is organized as a 2-way set associative cache. Thereare 128 sets in each cache, each set containing 2 lines (each line has its own tag address).Each cache line is 32 bytes wide. The In the Pentium processor , replacement in both thedata and instruction caches is handled by the LRU mechanism which requires one bit perset in each of the caches.Cache StructureThe instruction and data caches can be accessed simultaneously. The instruction cachecan provide up to 32 bytes of raw opcodes and the data cache can provide data for twodata references all in the same clock. This capability is implemented partially through thetag structure. The tags in the data cache are triple ported. One of the ports is dedicatedto snooping while the other two are used to lookup two independent addressescorresponding to data references from each of the pipelines. The instruction cachetags of the Pentium processor are also triple ported. Again, one port is dedicated tosupport snooping and the other two ports facilitate split line accesses (simultaneouslyaccessing upper half of one line and lower half of the next line. Each of the caches areparity protected. The operating modes of the caches are controlled by the CD (cachedisable) and NW (not writethrough) bits in CR0. TLB (Translation lookaside Buffers).Each of the caches are accessed with physical addresses and each cache has its own TLB(translation lookaside buffer) to translate linear addresses to physical addresses. TheTLBs associated with the instruction cache are single ported whereas the data cacheTLBs are fully dual ported to be able to translate two independent linear addresses fortwo data references simultaneously.19/153 MPMC© Pawar Virendra D.
  20. 20. The goal of an effective memory system is that the effective access time that theprocessor sees is very close to to, the access time of the cache. Most accesses that theprocessor makes to the cache are contained within this level. The achievement of thisgoal depends on many factors: the architecture of the processor, the behavioral propertiesof the programs being executed, and the size and organization of the cache. Caches workon the basis of the locality of program behavior. There are three principles involved: 1. Spatial Locality - Given an access to a particular location in memory, there is a high probability that other accesses will be made to either that or neighboring locations within the lifetime of the program. 2. Temporal Locality - This is complementary to spatial locality. Given a sequence of references to n locations, there is a high probability that references following this sequence will be made into the sequence. Elements of the sequence will again be referenced during the lifetime of the program. 3. Sequentiality- Given that a reference has been made to a particular location s it is likely that within the next several references a reference to the location of s + 1 will be made. Sequentiality is a restricted type of spatial locality and can be regarded as a subset of it. Some common termsProcessor reference that are found in the cache are called cache hits. References notfound in the cache are called cache misses. On a cache miss, the cache controlmechanism must fetch the missing data from memory and place it in the cache. Usuallythe cache fetches a spatial locality called the line from memory. The physical word is thebasic unit of access in the memory.The processor-cache interface can be characterized by a number of parameters. Thosethat directly affect processor performance include: 1. Access time for a reference found in the cache (a hit) - property of the cache size and organization. 2. Access time for a reference not found in the cache (a miss) - property of the memory organization. 3. Time to initially compute a real address given a virtual address (not-in-TLB-time) - property of the address translation facility, which, though strictly speaking, is not part of the cache, resembles the cache in most aspects and is discussed in this chapter.Data Cache Consistency Protocol (MESI Protocol)The Pentium processor Cache Consistency Protocol is a set of rules by which states are20/153 MPMC© Pawar Virendra D.
  21. 21. assigned to cached entries (lines). The rules apply for memory read/write cycles only. I/Oand special cycles are not run through the data cache. Every line in the Pentium processordata cache is assigned a state dependent on both Pentium processor generated activitiesand activities generated by other bus masters (snooping). The Pentium processor DataCache Protocol consists of four states that define whether a line is valid (HIT/MISS), if itis available in other caches, and if it has been MODIFIED. The four states are the M(Modified), E (Exclusive), S (Shared) and the I (Invalid) states and the protocol isreferred to as the MESI protocol. A definition of the states is given below:M - Modified: An M-state line is available in ONLY one cache and it is also MODIFIED(different from main memory). An M-state line can be accessed (read/writtento) without sending a cycle out on the bus.E - Exclusive: An E-state line is also available in ONLY one cache in the system, but theline is not MODIFIED (i.e., it is the same as main memory). An E-state line can beaccessed (read/written to) without generating a bus cycle. A write to an E-state line willcause the line to become MODIFIED.S - Shared: This state indicates that the line is potentially shared with other caches (i.e.the same line may exist in more than one cache). A read to an S-state line will notgenerate bus activity, but a write to a SHARED line will generate a write through cycleon the bus. The write through cycle may invalidate this line in other caches. A write to anS-state line will update the cache.I - Invalid: This state indicates that the line is not available in the cache. A read to thisline will be a MISS and may cause the Pentium processor to execute a LINE FILL (fetchthe whole line into the cache from main memory). A write to an INVALID line willcause the Pentium processor to execute a write-throughcycle on the bus.Inquire Cycles (Snooping)The purpose of inquire cycles is to check whether the address being presented iscontained within the caches in the Pentium processor.------------------------------------------------------------------------ ----------------------21/153 MPMC© Pawar Virendra D.
  22. 22. Cache OrganizationWithin the cache, there are three basic types of organization: 1. Direct Mapped 2. Fully Associative 3. Set AssociativeIn fully associative mapping, when a request is made to the cache, the requested addressis compared in a directory against all entries in the directory. If the requested address isfound (a directory hit), the corresponding location in the cache is fetched and returned tothe processor; otherwise, a miss occurs.22/153 MPMC© Pawar Virendra D.
  23. 23. Fully Associative CacheIn a direct mapped cache, lower order line address bits are used to access the directory.Since multiple line addresses map into the same location in the cache directory, the upperline address bits (tag bits) must be compared with the directory address to ensure a hit. Ifa comparison is not valid, the result is a cache miss, or simply a miss. The address givento the cache by the processor actually is subdivided into several pieces, each of which hasa different role in accessing data.23/153 MPMC© Pawar Virendra D.
  24. 24. Direct Mapped CacheThe set associative cache operates in a fashion somewhat similar to the direct-mappedcache. Bits from the line address are used to address a cache directory. However, nowthere are multiple choices: two, four, or more complete line addresses may be present inthe directory. Each of these line addresses corresponds to a location in a sub-cache. Thecollection of these sub-caches forms the total cache array. In a set associative cache, as inthe direct-maped cache, all of these sub-arrays can be accessed simultaneously, togetherwith the cache directory. If any of the entries in the cache directory match the referenceaddress, and there is a hit, the particular sub-cache array is selected and out gated back tothe processor. Set Associative Cache24/153 MPMC© Pawar Virendra D.
  25. 25. Cache Calculation Tag Line / Set Byte/Block Cache Main 512 bytes Memory 16Kb2 4 Lines 16 Bytes / 210 Lines line 16 bytes / line 2 SetsLine Size = 16 = 24 Byte / Block = 4Total Number of address linesto address main memory = 16 Kb = 214Total number of lines in Cache = 512 = 29Set or Ways = 2 512 = = 28 2 28Line or Set Size = 4 = 24 Line /Set Size = 4 2 Total Number lines in main memoryTag Size = Total Number of lines in cache set 10 2 = = 26 Tag size = 6 24214 (Total ) = 2 6 (Tag ) * 2 4 ( Line / Set ) * 2 4 ( Block / Byte)25/153 MPMC© Pawar Virendra D.
  26. 26. THE X87 FPUFLOATING-POINT UNITThe floating-point unit (FPU) of the Pentium processor is integrated with the integer uniton the first five stages of the U pipe line The fifth stage FB becomes X1. It is heavilypipelined. The FPU is designed to be able to accept one floating point .operation everyclock. It can receive up to two floating-point instructions every clock, one of which mustbe an exchange instruction.Floating-Point Pipeline StagesThe Pentium processor FPU has 8 pipeline stages, the first five of which it shares withthe integer unit. Integer instructions pass through only the first 5 stages. Integerinstructions use the fifth (X1) stage as a WB (write-back) stage. The 8 FP pipeline stages,and the activities that are performed in them are summarized below:PF Prefetch;D1 Instruction Decode;D2 Address generation;EX Memory and register read; conversion of FP data to external memory format andmemory write;X1 Floating-Point Execute stage one; conversion of external memory format to internalFP data format and write operand to FP register file; bypass 1 (bypass 1 described in the“Bypasses” section).X2 Floating-Point Execute stage two;WF Perform rounding and write floating-point result to register file; bypass 2 (bypass 2described in the “Bypasses” section).ER Error Reporting/Update Status Word.FPU BypassesThe Pentium processor stack architecture instruction set requires that all instructions haveone source operand on the top of the stack. Since most instructions also have theirdestination as the top of the stack, most instructions see a “top of stack bottleneck.” Newsource operands must be brought to the top of the stack before we can issue an arithmeticinstruction on them. This calls for extra usage of the exchange instruction, which allowsthe programmer to bring an available operand to the top of the stack.The following section describes the floating-point register file bypasses that exist on thePentium processor. The register file has two write ports and two read ports. The readports are used to read data out of the register file in the E stage. One write port is used towrite data into the register file in the X1 stage, and the other in the WF stage. A bypassallows data that is about to be written into the register file to be available as an operandthat is to be read from the register file by any succeeding floating-point instruction. Abypass is specified by a pair of ports (a write port and a read port) that get circumvented.Using the bypass, data is made available even before actually writing it to the registerfile.26/153 MPMC© Pawar Virendra D.
  27. 27. The following procedures are implemented:1. Bypass the X1 stage register file write port and the E stage register file read port.2. Bypass the WF stage register file write port and the E stage register file read port.With bypass 1, the result of a floating-point load (that writes to the register file in the X1stage) can bypass the X1 stage write and be sent directly to the operand fetch stage or Estage of the next instruction. With bypass 2, the result of any arithmetic operation canbypass the WF stage write to the register file, and be sent directly to the desired executionunit as an operand for the next instruction.PROGRAMMING WITH THE x87 FPUThe x87 Floating-Point Unit (FPU) provides high-performance floating-point processingcapabilities for use in graphics processing, scientific, engineering, and businessapplications. It supports the floating-point, integer, and packed BCD integer data typesand the floating-point processing algorithms and exception handling architecture definedin the IEEE Standard 754 for Binary Floating-Point Arithmetic.X87 FPU EXECUTION ENVIRONMENTThe x87 FPU represents a separate execution environment within the IA-32. Thisexecution environment consists of eight data registers (called the x87 FPU data registers)and the following special-purpose registers:• Status register• Control register• Tag word register• Last instruction pointer register• Last data (operand) pointer register• Opcode register These registers are described in the following sections.x87 FPU Data RegistersThe x87 FPU data registers consist of eight 80-bit registers. Values are stored in theseregisters in the double extended-precision floating-point format. When floating-point,integer, or packed BCD integer values are loaded from memory into any of the x87 FPUdata registers, the values are automatically converted into double extended precisionfloating-point format (if they are not already in that format). When computation resultsare subsequently transferred back into memory from any of the x87 FPU registers, theresults can be left in the double extended-precision floating-point format or convertedback into a shorter floating-point format, an integer format, or the packed BCD integerformat.27/153 MPMC© Pawar Virendra D.
  28. 28. x87 FPU Execution EnvironmentThe x87 FPU instructions treat the eight x87 FPU data registers as a register stack .Alladdressing of the data registers is relative to the register on the top of the stack. Theregister number of the current top-of-stack register is stored in the TOP (stack TOP) fieldin the x87 FPU status word. Load operations decrement TOP by one and load a value intothe new top of- stack register, and store operations store the value from the current TOPregister in memory and then increment TOP by one. (For the x87 FPU, a load operation isequivalent to a push and a store operation is equivalent to a pop.) Note that load and storeoperations are also available that do not push and pop the stack. x87 FPU Data Register Stack28/153 MPMC© Pawar Virendra D.
  29. 29. If a load operation is performed when TOP is at 0, register wraparound occurs and thenew value of TOP is set to 7. The floating-point stack-overflow exception indicates whenwraparound might cause an unsaved value to be overwritten.Many floating-point instructions have several addressing modes that permit theprogrammer to implicitly operate on the top of the stack, or to explicitly operate onspecific registers relative to the TOP. Assemblers support these register addressingmodes, using the expression ST(0), or simply ST, to represent the current stack top andST(i) to specify the ith register from TOP in the stack (0 ≤ i ≤ 7). For example, if TOPcontains 011B (register 3 is the top of the stack), the following instruction would add thecontents of two registers in the stack (registers 3 and 5):FADD ST, ST(2);Figure shows an example of how the stack structure of the x87 FPU registers andinstructions are typically used to perform a series of computations. Here, a two-dimensional dot product is computed, as follows:1. The first instruction (FLD value1) decrements the stack register pointer (TOP) and loads the value 5.6 from memory into ST(0). The result of this operation is shown in snapshot (a).2. The second instruction multiplies the value in ST(0) by the value 2.4 from memory and stores the result in ST(0), shown in snap-shot (b).3. The third instruction decrements TOP and loads the value 3.8 in ST(0).4. The fourth instruction multiplies the value in ST(0) by the value 10.3 from memory and stores the result in ST(0), shown in snap-shot (c).5. The fifth instruction adds the value and the value in ST(1) and stores the result in ST(0), shown in snap-shot (d). Example x87 FPU Dot Product Computation29/153 MPMC© Pawar Virendra D.
  30. 30. MICROPROCESSOR INITIALIZATION ANDCONFIGURATIONBefore normal operation of the Pentium processor can begin, the Pentium processor mustbe initialized by driving the RESET pin active. The RESET pin forces the Pentiumprocessor to begin execution in a known state. Several features are optionally invoked atthe falling edge of RESET: Built-in-Self-Test (BIST), Functional Redundancy Checkingand Tristate Test Mode.In addition to the standard RESET pin, the Pentium processor has implemented aninitialization pin (INIT) that allows the processor to begin execution in a known statewithout disrupting the contents of the internal caches or the floating-point state.POWER UP SPECIFICATIONSDuring power up, RESET must be asserted while VCC is approaching nominal operatingvoltage to prevent internal bus contention which could negatively affect the reliability ofthe processor. It is recommended that CLK begin toggling within 150 ms after VCCreaches its proper operating level. This recommendation is only to ensure long termreliability of the device.In order for RESET to be recognized, the CLK input needs to be toggling. RESET mustremain asserted for 1 millisecond after VCC and CLK have reached their AC/DCspecifications.TEST AND CONFIGURATION FEATURES (BIST, FRC,TRISTATE TEST MODE)The INIT, FLUSH#, and FRCMC# inputs are sampled when RESET transitions fromhigh to low to determine if BIST will be run, or if tristate test mode or checker mode willbe entered (respectively). If RESET is driven synchronously, these signals must be attheir valid level and meet setup and hold times on the clock before the falling edge ofRESET. If RESET is asserted asynchronously, these signals must be at their valid leveltwo clocks before and after RESET transitions from high to low.Built In Self-TestSelf-test is initiated by driving the INIT pin high when RESET transitions from highto low. No bus cycles are run by the Pentium processor during self test. The duration ofself test is approximately 219 core clocks. Approximately 70% of the devices in thePentium processor are tested by BIST. The Pentium processor BIST consists of two parts:hardware self-test and microcode self-test. During the hardware portion of BIST, themicrocode ROM and all large PLAs are tested. All possible input combinations of themicrocode ROM and PLAs are tested. The constant ROMs, BTB, TLBs, and all cachesare tested by the microcode portion of BIST. The array tests (caches, TLBs and BTB)have two passes. On the first pass, data patterns are written to arrays, read back andchecked for mismatches. The second pass writes the complement of the initial datapattern, reads it back, and checks for mismatches. The constant ROMs are tested by usingthe microcode to add various constants and check the result against a stored value.30/153 MPMC© Pawar Virendra D.
  31. 31. Upon successful completion of BIST, the cumulative result of all tests are stored in theEAX register. If EAX contains 0h, then all checks passed; any non-zero result indicates afaulty unitTristate Test ModeWhen the FLUSH# pin is sampled low when RESET transitions from high to low, thePentium processor enters tristate test mode. The Pentium processor floats all of its outputpins and bidirectional pins including pins which are never floated during normaloperation (except TDO). Tristate test mode can be initiated in order to facilitate testing byexternal circuitry to test board interconnects. The Pentium processor remains in tristatetest mode until the RESET pin is asserted again.Functional Redundancy CheckingThe functional redundancy checking master/checker configuration input is sampled whenRESET is high to determine whether the Pentium processor is configured in master mode(FRCMC# high) or checker mode (FRCMC# low). The final master/checkerconfiguration of the Pentium processor is determined the clock before the falling edge ofRESET. When configured as a master, the Pentium processor drives its output pins asrequired by the bus protocol. When configured as a checker, the Pentium processortristates all outputs (except IERR#, PICD0, PICD1 and TDO) and samples the outputpins (that would normally be driven in master mode). If the sampled value differs fromthe value computed internally, the Pentium processor asserts IERR# to indicate an error.INITIALIZATION WITH RESET, INIT AND BISTTwo pins, RESET and INIT, are used to reset the Pentium processor in different manners. A“cold” or “power on” RESET refers to the assertion of RESET while power is initially beingapplied to the Pentium processor. A “warm” RESET refers to the assertion of RESET or INITwhile VCC and CLK remain within specified operating limits.Table 3-1 shows the effect of asserting RESET and/or INIT.Toggling either the RESET pin or the INIT pin individually forces the Pentium processorto begin execution at address FFFFFFF0h. The internal instruction cache and data cacheare invalidated when RESET is asserted (modified lines in the data cache are NOTwritten back). The instruction cache and data cache are not altered when the INIT pin isasserted without RESET. In both cases, the branch target buffer (BTB) and translationlookaside buffers (TLBs) are invalidated. After RESET (with or without BIST) or INIT,the Pentium processor will start executing instructions at location FFFFFFF0H. When thefirst Intersegment Jump or Call instruction is executed, address lines A20-A31 will bedriven low for CS-relative memory cycles and the Pentium processor will only execute31/153 MPMC© Pawar Virendra D.
  32. 32. instructions in the lower one Mbyte of physical memory. This allows the system designerto use a ROM at the top of physical memory to initialize the system. RESET is internallyhardwired and forces the Pentium processor to terminate all execution and bus cycleactivity within 2 clocks. No instruction or bus activity will occur as long as RESET isactive. INIT is implemented as an edge triggered interrupt and will be recognized whenan instruction boundary is reached. As soon as the Pentium processor completes the INITsequence, instruction execution and bus cycle activity will continue at addressFFFFFFF0h even if the INIT pin is not deasserted. At the conclusion of RESET (with orwithout self-test) or INIT, the DX register will contain a component identifier. The upperbyte will contain 05h and the lower byte will contain a stepping identifier.32/153 MPMC© Pawar Virendra D.
  33. 33. BUS CYCLESThe Pentium processor bus is designed to support a 528-Mbyte/sec data transfer rate at 66MHz. All data transfers occur as a result of one or more bus cycles.PHYSICAL MEMORY AND I/O INTERFACEPentium processor memory is accessible in 8-, 16-, 32-, and 64-bit quantities. Pentiumprocessor I/O is accessible in 8-, 16-, and 32-bit quantities. The Pentium processor candirectly address up to 4 Gbytes of physical memory, and up to 64 Kbytes of I/O.In hardware, memory space is organized as a sequence of 64-bit quantities. Each 64-bitlocation has eight individually addressable bytes at consecutive memory addresses Memory OrganizationThe I/O space is organized as a sequence of 32-bit quantities. Each 32-bit quantity hasfour individually addressable bytes at consecutive memory addresses. See Figure for aconceptual diagram of the I/O space. I/O Space Organization33/153 MPMC© Pawar Virendra D.
  34. 34. Sixty-four-bit memories are organized as arrays of physical quadwords (8-byte words).Physical quadwords begin at addresses evenly divisible by 8. The quadwords areaddressable by physical address lines A31-A3.Thirty-two-bit memories are organized as arrays of physical dwords (4-byte words).Physical dwords begin at addresses evenly divisible by 4. The dwords are addressable byphysical address lines A31-A3 and A2. A2 can be decoded from the byte enables .Sixteen-bit memories are organized as arrays of physical words (2-byte words). Physicalwords begin at addresses evenly divisible by 2.DATA TRANSFER MECHANISMAll data transfers occur as a result of one or more bus cycles. Logical data operands ofbyte, word, dword, and quadword lengths may be transferred. Data may be accessed atany byte boundary, but two cycles may be required for misaligned data transfers. ThePentium processor considers a 2-byte or 4-byte operand that crosses a 4-byte boundary tobe misaligned. In addition, an 8-byte operand that crosses an 8-byte boundary ismisaligned. The Pentium processor address signals are split into two components.High-order address bits are provided by the address lines A31-A3. The byte enablesBE7#- BE0# form the low-order address and selects the appropriate byte of the 8-bytedata bus.For both memory and I/O accesses, the byte enable outputs indicate which of theassociated data bus bytes are driven valid for write cycles and on which bytes data isexpected back for read cycles. Non-contiguous byte enable patterns will never occur. Generating A2-A0 from BE7-0#Interfacing With 8-, 16-, 32-, and 64-Bit MemoriesIn 64-bit physical memories such as, each 8-byte quadword begins at a byte addressthat is a multiple of eight. A31-A3 are used as an 8-byte quadword select and BE7#-BE0# select individual bytes within the word.34/153 MPMC© Pawar Virendra D.
  35. 35. Pentium® Processor with 64-Bit MemoryThe Figure shows the Pentium processor data bus interface to 32-, 16- and 8-bit widememories. External byte swapping logic is needed on the data lines so that data issupplied to and received from the Pentium processor on the correct data pins see Table.For memory widths smaller than 64 bits, byte assembly logic is needed to return all bytesof data requested by the Pentium processor in one cycle. Addressing 32-, 16- and 8-Bit Memories35/153 MPMC© Pawar Virendra D.
  36. 36. Data Bus Interface to 32-, 16- and 8-Bit MemoriesOperand alignment and size dictate when two cycles are required for a data transfer.36/153 MPMC© Pawar Virendra D.
  37. 37. BUS STATE DEFINITIONThis section describes the Pentium processor bus states in detail. See Figure for the busstate diagram.Ti: This is the bus idle state. In this state, no bus cycles are being run. The Pentiumprocessor may or may not be driving the address and status pins, depending on the stateof the HLDA,AHOLD, and BOFF# inputs. An asserted BOFF# or RESET will alwaysforce the state machine back to this state. HLDA will only be driven in this state.T1: This is the first clock of a bus cycle. Valid address and status are driven out andADS# is asserted. There is one outstanding bus cycle.T2: This is the second and subsequent clock of the first outstanding bus cycle. In state T2,data is driven out (if the cycle is a write), or data is expected (if the cycle is a read), andthe BRDY# pin is sampled. There is one outstanding bus cycle.T12: This state indicates there are two outstanding bus cycles, and that the Pentiumprocessor is starting the second bus cycle at the same time that data is being transferredfor the first. In T12, the Pentium processor drives the address and status and assertsADS# for the second outstanding bus cycle, while data is transferred and BRDY# issampled for the first outstanding cycle.T2P: This state indicates there are two outstanding bus cycles, and that both are in theirsecond and subsequent clocks. In T2P, data is being transferred and BRDY# is sampledfor the first outstanding cycle. The address, status and ADS# for the second outstandingcycle were driven sometime in the past (in state T12).TD: This state indicates there is one outstanding bus cycle, that its address, status andADS# have already been driven sometime in the past (in state T12), and that the data andBRDY# pins are not being sampled because the data bus requires one dead clock to turnaround between consecutive reads and writes, or writes and reads. The Pentium processorenters TD if in the previous clock there were two outstanding cycles, the last BRDY# wasreturned, and a dead clock is needed. The timing diagrams in the next section giveexamples when a dead clock is needed.Table gives a brief summary of bus activity during each bus state. Figure shows thePentium processor bus state diagram. Pentium® Processor Bus Activity37/153 MPMC© Pawar Virendra D.
  38. 38. Pentium® Processor Bus Control State Machine38/153 MPMC© Pawar Virendra D.
  39. 39. BUS CYCLESThe Pentium processor requests data transfer cycles, bus cycles, and bus operations.A data transfer cycle is one data item, up to 8 bytes in width, being returned to thePentium processor or accepted from the Pentium processor with BRDY# asserted. A buscycle begins with the Pentium processor driving an address and status and assertingADS#, and ends when the last BRDY# is returned. A bus cycle may have 1 or 4 datatransfers. A burst cycle is a bus cycle with 4 data transfers. A bus operation is a sequenceof bus cycles to carry out a specific function, such as a locked read-modify-write or aninterrupt acknowledge.Single-Transfer CycleThe Pentium processor supports a number of different types of bus cycles. The simplesttype of bus cycle is a single-transfer non-cacheable 64-bit cycle, either with or withoutwait states. Non-pipelined read and write cycles with 0 wait states are shown in Figure Non Pipelined Read or Write39/153 MPMC© Pawar Virendra D.
  40. 40. The Pentium processor initiates a cycle by asserting the address status signal (ADS#) inthe first clock. The clock in which ADS# is asserted is by definition the first clock in thebus cycle. The ADS# output indicates that a valid bus cycle definition and address isavailable on the cycle definition pins and the address bus. The CACHE# output isdeasserted (high) to indicate that the cycle will be a single transfer cycle.For a zero wait state transfer, BRDY# is returned by the external system in the secondclock of the bus cycle. BRDY# indicates that the external system has presented valid dataon the data pins in response to a read or the external system has accepted data in responseto a write. The Pentium processor samples the BRDY# input in the second andsubsequent clocks of a bus CycleIf the system is not ready to drive or accept data, wait states can be added to these cyclesby not returning BRDY# to the processor at the end of the second clock. Cycles of thistype, with one and two wait states added are shown in Figure .Note that BRDY# must bedriven inactive at the end of the second clock.Burst CyclesFor bus cycles that require more than a single data transfer (cacheable cycles andwriteback cycles), the Pentium processor uses the burst data transfer. In burst transfers, anew data item can be sampled or driven by the Pentium processor in consecutive clocks.In addition the addresses of the data items in burst cycles all fall within the same 32-bytealigned area (corresponding to an internal Pentium processor cache line).The implementation of burst cycles is via the BRDY# pin. While running a bus cycle ofmore than one data transfer, the Pentium processor requires that the memory systemperform a burst transfer and follow the burst order see Table. Given the first address inthe burst sequence, the address of subsequent transfers must be calculated by externalhardware. This requirement exists because the Pentium processor address and byte-enables are asserted for the first transfer and are not re-driven for each transfer. The burstsequence is optimized for two bank memory subsystems and is shown in Table Pentium Processor Burst Order40/153 MPMC© Pawar Virendra D.
  41. 41. BURST READ CYCLESWhen initiating any read, the Pentium processor will present the address and byte enablesfor the data item requested. When the cycle is converted into a cache linefill, the first dataitem returned should correspond to the address sent out by the Pentium processor;however, the byte enables should be ignored, and valid data must be returned on all 64data lines. In addition, the address of the subsequent transfers in the burst sequence mustbe calculated by external hardware since the address and byte enables are not re-drivenfor each transfer.Figure shows a cacheable burst read cycle. Note that in this case the initial cyclegenerated by the Pentium processor might have been satisfied by a single data transfer,but was transformed into a multiple-transfer cache fill by KEN# being returned active onthe clock that the first BRDY# is returned. In this case KEN# has such an effect becausethe cycle is internally cacheable in the Pentium processor (CACHE# pin is driven active).KEN# is only sampled once during a cycle to determine cacheability. Basic Burst Read Cycle41/153 MPMC© Pawar Virendra D.
  42. 42. BURST WRITE CYCLESFigure shows the timing diagram of basic burst write cycle. KEN# is ignored in burstwrite cycle. If the CACHE# pin is active (low) during a write cycle, it indicates that thecycle will be a burst writeback cycle. Burst write cycles are always writebacks ofmodified lines in the data cache. Writeback cycles have several causes:1. Writeback due to replacement of a modified line in the data cache.2. Writeback due to an inquire cycle that hits a modified line in the data cache.3. Writeback due to an internal snoop that hits a modified line in the data cache.4. Writebacks caused by asserting the FLUSH# pin.5. Writebacks caused by executing the WBINVD instruction.The only write cycles that are burstable by the Pentium processor are writeback cycles.All other write cycles will be 64 bits or less, single transfer bus cycles. Basic Burst Write CycleFor writeback cycles, the lower five bits of the first burst address always starts at zero;therefore, the burst order becomes 0, 8h, 10h, and 18h. Again, note that the address of thesubsequent transfers in the burst sequence must be calculated by external hardware sincethe Pentium processor does not drive the address and byte enables for each transfer.42/153 MPMC© Pawar Virendra D.
  43. 43. Locked OperationsThe Pentium processor architecture provides a facility to perform atomic accesses ofmemory. For example, a programmer can change the contents of a memory-basedvariable and be assured that the variable was not accessed by another bus master betweenthe read of the variable and the update of that variable. This functionality is provided forselect instructions using a LOCK prefix, and also for instructions which implicitlyperform locked read modify write cycles such as the XCHG (exchange) instruction whenone of its operands is memory based. Locked cycles are also generated when a segmentdescriptor or page table entry is updated and during interrupt acknowledge cycles.In hardware, the LOCK functionality is implemented through the LOCK# pin, whichindicates to the outside world that the Pentium processor is performing a read-modify-write sequence of cycles, and that the Pentium processor should be allowed atomicaccess for the location that was accessed with the first locked cycle. Locked operationsbegin with a read cycle and end with a write cycle. Note that the data width read is notnecessarily the data width written. For example, for descriptor access bit updates thePentium processor fetches eight bytes and writes one byte.A locked operation is a combination of one or multiple read cycles followed by one ormultiple write cycles. Programmer generated locked cycles and locked page table /directory accesses are treated differently and are described in the following sections.Snooping (Inquire)When operating in an MP system, IA-32 processors (beginning with the Intel486processor) have the ability to snoop other processor’s accesses to system memory andto their internal caches. They use this snooping ability to keep their internal cachesconsistent both with system memory and with the caches in other processors on the bus.For example, in the Pentium and P6 family processors, if through snooping one processordetects that another processor intends to write to a memory location that it currently hascached in shared state, the snooping processor will invalidate its cache line forcing it toperform a cache line fill the next time it accesses the same memory location..43/153 MPMC© Pawar Virendra D.
  44. 44. REGISTER SET Alternate General Purpose Register Names44/153 MPMC© Pawar Virendra D.
  45. 45. • I/O ports — The IA-32 architecture supports a transfers of data to and frominput/output (I/O) ports.• Control registers — The five control registers (CR0 through CR4) determine theoperating mode of the processor and the characteristics of the currently executing task.• Memory management registers — The GDTR, IDTR, task register, and LDTRspecify the locations of data structures used in protected mode memory management.• Debug registers — The debug registers (DR0 through DR7) control and allowmonitoring of the processor’s debugging operations.BASIC PROGRAM EXECUTION REGISTERSThe processor provides 16 basic program execution registers for use in general systemand application programming (see Figure ). These registers can be grouped as follows:• General-purpose registers. These eight registers are available for storing operands andpointers.• Segment registers. These registers hold up to six segment selectors.• EFLAGS (program status and control) register. The EFLAGS register report on thestatus of the program being executed and allows limited (application-program level)control of the processor.• EIP (instruction pointer) register. The EIP register contains a 32-bit pointer to thenext instruction to be executed.• EAX — Accumulator for operands and results data• EBX — Pointer to data in the DS segment• ECX — Counter for string and loop operations• EDX — I/O pointer• ESI — Pointer to data in the segment pointed to by the DS register; source pointer forstring operations• EDI — Pointer to data (or destination) in the segment pointed to by the ES register;destination pointer for string operations• ESP — Stack pointer (in the SS segment)• EBP — Pointer to data on the stack (in the SS segment)As shown in Figure 3-5, the lower 16 bits of the general-purpose registers map directly tothe register set found in the 8086 and Intel 286 processors and can be referenced with thenames AX, BX, CX, DX, BP, SI, DI, and SP. Each of the lower two bytes of the EAX,EBX, ECX, and EDX registers can be referenced by the names AH, BH, CH, and DH(high bytes) and AL, BL, CL, and DL (low bytes).DATA TYPESThis chapter introduces data types defined for the IA-32 architecture.FUNDAMENTAL DATA TYPESThe fundamental data types of IA-32 architecture are bytes, words, doublewords,quadwords, and double quadwords (see Figure ). A byte is eight bits, a word is 2 bytes45/153 MPMC© Pawar Virendra D.
  46. 46. (16 bits), a doubleword is 4 bytes (32 bits), a quadword is 8 bytes (64 bits), and a doublequadword is 16 bytes (128 bits). A subset of the IA-32 architecture instructions operateson these fundamental data types without any additional operand typing.Figure shows the byte order of each of the fundamental data types when referenced asoperands in memory. The low byte (bits 0 through 7) of each data type occupies thelowest address in memory and that address is also the address of the operand. Bytes, Words, Doublewords, Quadwords, and Double Quadwords in Memory46/153 MPMC© Pawar Virendra D.
  47. 47. AlignmentWords, Doublewords, Quadwords, and Double QuadwordsWords, doublewords, and quadwords do not need to be aligned in memory on naturalboundaries. The natural boundaries for words, double words, and quadwords are even-numbered addresses, addresses evenly divisible by four, and addresses evenly divisibleby eight, respectively. However, to improve the performance of programs, data structures(especially stacks) should be aligned on natural boundaries whenever possible. Thereason for this is that the processor requires two memory accesses to make an unalignedmemory access; aligned accesses require only one memory access. A word ordoubleword operand that crosses a 4-byte boundary or a quadword operand that crossesan 8-byte boundary is considered unaligned and requires two separate memory bus cyclesfor access.Some instructions that operate on double quadwords require memory operands to bealigned on a natural boundary. These instructions generate a general-protection exception(#GP) if an unaligned operand is specified. A natural boundary for a double quadword isany address evenly divisible by 16. Other instructions that operate on double quadwordspermit unaligned access (without generating a general-protection exception). However,additional memory bus cycles are required to access unaligned data from memory.NUMERIC DATA TYPESAlthough bytes, words, and doublewords are the fundamental data types of the IA-32architecture, some instructions support additional interpretations of these data types toallow operations to be performed on numeric data types (signed and unsigned integers,and floating-point numbers). See Figure47/153 MPMC© Pawar Virendra D.
  48. 48. Numeric Data TypesOPERAND ADDRESSINGIA-32 machine-instructions act on zero or more operands. Some operands are specifiedexplicitly and others are implicit. The data for a source operand can be located in:• the instruction itself (an immediate operand)• a register• a memory location• an I/O portWhen an instruction returns data to a destination operand, it can be returned to:• a register• a memory location• an I/O portImmediate OperandsSome instructions use data encoded in the instruction itself as a source operand. Theseoperands are called immediate operands (or simply immediates). For example, thefollowing ADD instruction adds an immediate value of 14 to the contents of the EAXregister:ADD EAX, 1448/153 MPMC© Pawar Virendra D.
  49. 49. All arithmetic instructions (except the DIV and IDIV instructions) allow the sourceoperand to be an immediate value. The maximum value allowed for an immediateoperand varies among instructions, but can never be greater than the maximum value ofan unsigned doubleword integer (232).Register OperandsSource and destination operands can be any of the following registers, depending on theinstruction being executed:• 32-bit general-purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, or EBP)• 16-bit general-purpose registers (AX, BX, CX, DX, SI, DI, SP, or BP)• 8-bit general-purpose registers (AH, BH, CH, DH, AL, BL, CL, or DL)• segment registers (CS, DS, SS, ES, FS, and GS)• EFLAGS register• x87 FPU registers (ST0 through ST7, status word, control word, tag word, data operand pointer, and instruction pointer) in a pairSome instructions (such as the DIV and MUL instructions) use quadword operands containedof 32-bit registers. Register pairs are represented with a colon separating them. Forexample, in the register pair EDX:EAX, EDX contains the high order bits and EAXcontains the low order bits of a quadword operand. Several instructions (such as thePUSHFD and POPFD instructions) are provided to load and store the contents of theEFLAGS register or to set or clear individual flags in this register. Otherinstructions (such as the Jcc instructions) use the state of the status flags in the EFLAGSregister as condition codes for branching or other decision making operations.The processor contains a selection of system registers that are used to control memorymanagement, interrupt and exception handling, task management, processormanagement, and debugging activities. Some of these system registers are accessible byan application program, the operating system, or the executive through a set of systeminstructions. When accessing a system register with a system instruction, the register isgenerally an implied operand of the instruction.Memory OperandsSource and destination operands in memory are referenced by means of a segmentselector and an offset (see Figure). Segment selectors specify the segment containing theoperand. Offsets specify the linear or effective address of the operand. Offsets can be 32bits (represented by the notation m16:32) or 16 bits (represented by the notation m16:16). Memory Operand AddressSpecifying a Segment SelectorThe segment selector can be specified either implicitly or explicitly. The most commonmethod of specifying a segment selector is to load it in a segment register and then allow49/153 MPMC© Pawar Virendra D.
  50. 50. the processor to select the register implicitly, depending on the type of operation beingperformed. The processor automatically chooses a segment according to the rules givenin Table When storing data in memory or loading data from memory, the DS segmentdefault can be overridden to allow other segments to be accessed. Within an assembler,the segment override is generally handled with a colon “:” operator. For example, thefollowing MOV instruction moves a value from register EAX into the segment pointed toby the ES register. The offset into the segment is contained in the EBX register:MOV ES:[EBX], EAX; Default Segment Selection RulesAt the machine level, a segment override is specified with a segment-override prefix,which is a byte placed at the beginning of an instruction. The following default segmentselections cannot be overridden:• Instruction fetches must be made from the code segment.• Destination strings in string instructions must be stored in the data segment pointed toby the ES register.• Push and pop operations must always reference the SS segment.Some instructions require a segment selector to be specified explicitly. In these cases, the16-bit segment selector can be located in a memory location or in a 16-bit register. Forexample, the following MOV instruction moves a segment selector located in register BXinto segment register DS:MOV DS, BXSegment selectors can also be specified explicitly as part of a 48-bit far pointer inmemory. Here, the first doubleword in memory contains the offset and the next wordcontains the segment selector.Specifying an OffsetThe offset part of a memory address can be specified directly as a static value (called adisplacement) or through an address computation made up of one or more of thefollowing components:• Displacement — An 8-, 16-, or 32-bit value.• Base — The value in a general-purpose register.• Index — The value in a general-purpose register.• Scale factor — A value of 2, 4, or 8 that is multiplied by the index value.50/153 MPMC© Pawar Virendra D.
  51. 51. The offset which results from adding these components is called an effective address.Each of these components can have either a positive or negative (2s complement) value,with the exception of the scaling factor. Figure 3-11 shows all the possible ways thatthese components can be combined to create an effective address in the selected segment. Offset (or Effective Address) ComputationThe uses of general-purpose registers as base or index components are restricted in thefollowing manner:• The ESP register cannot be used as an index register.• When the ESP or EBP register is used as the base, the SS segment is the defaultsegment. In all other cases, the DS segment is the default segment.The base, index, and displacement components can be used in any combination, and anyof these components can be null. A scale factor may be used only when an index also isused. Each possible combination is useful for data structures commonly used byprogrammers in high-level languages and assembly language. The following addressingmodes suggest uses for common combinations of address components.• Displacement A displacement alone represents a direct (uncomputed) offset to theoperand. Because the displacement is encoded in the instruction, this form of an addressis sometimes called an absolute or static address. It is commonly used to access astatically allocated scalar operand.• Base A base alone represents an indirect offset to the operand. Since the value in thebase register can change, it can be used for dynamic storage of variables and datastructures.• Base + Displacement A base register and a displacement can be used together fortwo distinct purposes:• As an index into an array when the element size is not 2, 4, or 8 bytes—Thedisplacement component encodes the static offset to the beginning of the array. The baseregister holds the results of a calculation to determine the offset to a specific elementwithin the array.• To access a field of a record: the base register holds the address of the beginning of therecord, while the displacement is a static offset to the field.An important special case of this combination is access to parameters in a procedureactivation record. A procedure activation record is the stack frame created when aprocedure is entered. Here, the EBP register is the best choice for the base register,51/153 MPMC© Pawar Virendra D.