Assembly Language x86 Family Architecture Motaz K. Saad Spring 2007 Motaz K. Saad, Dept. of CS
Overview General Concepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
General Concepts Basic microcomputer design Instruction execution cycle Reading from memory How programs run Motaz K. Saad, Dept. of CS
Basic Microcomputer Design Clock synchronizes CPU operations Control unit (CU) coordinates sequence of execution steps ALU performs arithmetic and bitwise processing Motaz K. Saad, Dept. of CS
Motaz K. Saad, Dept. of CS Processor Control  Unit Arithmetic  Logic Unit (ALU) Arithmetic  Logic Unit (ALU) Input Devices Storage Devices Output Devices Memory Data Information Instructions Data Information Instructions Data Information Control  Unit
Motaz K. Saad, Dept. of CS
Clock Synchronizes all CPU and BUS operations Machine (clock) cycle measures time of a single operation Clock is used to trigger events Motaz K. Saad, Dept. of CS
What's Next General Concepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
Instruction Execution Cycle Fetch Decode Fetch operands Execute  Store output Motaz K. Saad, Dept. of CS
Cache Memory High-speed expensive static RAM both inside and outside the CPU. Level-1 cache: inside the CPU Level-2 cache: outside the CPU Cache hit: when data to be read is already in cache memory Cache miss: when data to be read is not in cache memory. Motaz K. Saad, Dept. of CS
How a Program Runs Motaz K. Saad, Dept. of CS
Multitasking OS can run multiple programs at the same time. Multiple threads of execution within the same program. Scheduler utility assigns a given amount of CPU time to each running program. Rapid switching of tasks gives illusion that all programs are running at once the processor must support task switching. Motaz K. Saad, Dept. of CS
IA-32 Processor Architecture Modes of operation Basic execution environment Floating-point unit Intel Microprocessor history Motaz K. Saad, Dept. of CS
Modes of Operation Protected mode native mode (Windows, Linux) Real-address mode native MS-DOS System management mode power management, system security, diagnostics Virtual-8086 mode hybrid of Protected each program has its own 8086 computer Motaz K. Saad, Dept. of CS
Basic Execution Environment Addressable memory General-purpose registers Index and base registers Specialized register uses Status flags Floating-point, MMX, XMM registers Motaz K. Saad, Dept. of CS
Addressable Memory Protected mode 4 GB 32-bit address Real-address and Virtual-8086 modes 1 MB space 20-bit address Motaz K. Saad, Dept. of CS
X86 General-Purpose Registers Named storage locations inside the CPU, optimized for speed. Motaz K. Saad, Dept. of CS
Accessing Parts of Registers Use 8-bit name, 16-bit name, or 32-bit name Applies to EAX, EBX, ECX, and EDX Motaz K. Saad, Dept. of CS
Index and Base Registers Some registers have only a 16-bit name for their lower half: Motaz K. Saad, Dept. of CS
Some Specialized Register Uses Segment CS – code segment DS – data segment SS – stack segment ES, FS, GS - additional segments EIP – instruction pointer EFLAGS status and control flags each flag is a single binary bit General-Purpose EAX – accumulator EBX – base register ECX – loop counter EDX – data register ESP – stack pointer ESI, EDI – index registers EBP – extended frame pointer (stack) Motaz K. Saad, Dept. of CS
Status Flags Carry unsigned arithmetic out of range Overflow signed arithmetic out of range Sign result is negative Zero result is zero Auxiliary Carry carry from bit 3 to bit 4 Parity sum of 1 bits is an even number Motaz K. Saad, Dept. of CS
Intel Microprocessor History Intel 8086, 80286 IA-32 processor family P6 processor family CISC and RISC Motaz K. Saad, Dept. of CS
Early Intel Microprocessors Intel 8080 64K addressable RAM 8-bit registers CP/M operating system S-100 BUS architecture 8-inch floppy disks! Intel 8086/8088 IBM-PC Used 8088 1 MB addressable RAM 16-bit registers 16-bit data bus (8-bit for 8088) separate floating-point unit (8087) Motaz K. Saad, Dept. of CS
The IBM-AT Intel 80286 16 MB addressable RAM Protected memory several times faster than 8086 introduced IDE bus architecture 80287 floating point unit Motaz K. Saad, Dept. of CS
Intel IA-32 Family Intel386 4 GB addressable RAM, 32-bit registers, paging (virtual memory) Intel486 instruction pipelining Pentium superscalar, 32-bit address bus, 64-bit internal data path Motaz K. Saad, Dept. of CS
Intel P6 Family Pentium Pro advanced optimization techniques in microcode Pentium II MMX (multimedia) instruction set Pentium III SIMD (streaming extensions) instructions Pentium 4 and Xeon Intel NetBurst micro-architecture, tuned for multimedia Motaz K. Saad, Dept. of CS
CISC and RISC CISC – complex instruction set large instruction set high-level operations requires microcode interpreter examples: Intel 80x86 family RISC – reduced instruction set simple, atomic instructions small instruction set directly executed by hardware examples:  ARM (Advanced RISC Machines) DEC Alpha (now Compaq) Motaz K. Saad, Dept. of CS
What's Next General Concepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
IA-32 Memory Management Real-address mode Calculating linear addresses Protected mode Multi-segment model Paging Motaz K. Saad, Dept. of CS
Real-Address mode 1 MB RAM maximum addressable Application programs can access any area of memory Single tasking Supported by MS-DOS operating system Motaz K. Saad, Dept. of CS
Segmented Memory Segmented memory addressing: absolute (linear) address is a combination of a 16-bit segment value added to a 16-bit offset  linear addresses one segment Motaz K. Saad, Dept. of CS
Calculating Linear Addresses Given a segment address, multiply it by 16 (add a hexadecimal zero), and add it to the offset Example: convert 08F1:0100 to a linear address Adjusted Segment value: 0 8 F 1 0 Add the offset:  0 1 0 0 Linear address:  0 9 0 1 0 Motaz K. Saad, Dept. of CS
Your turn . . . What linear address corresponds to the segment/offset address 028F:0030? 028F0 + 0030 =  02920 Always use hexadecimal notation for addresses. Motaz K. Saad, Dept. of CS
Your turn . . . What segment addresses correspond to the linear address 28F30h? Many different segment-offset addresses can produce the linear address 28F30h. For example: 28F0:0030, 28F3:0000, 28B0:0430, . . . Motaz K. Saad, Dept. of CS
Protected Mode  (1 of 2) 4 GB addressable RAM (00000000 to FFFFFFFFh) Each program assigned a memory partition which is protected from other programs Designed for multitasking Supported by Linux & MS-Windows Motaz K. Saad, Dept. of CS
Protected mode  (2 of 2) Segment descriptor tables Program structure code, data, and stack areas CS, DS, SS segment descriptors global descriptor table (GDT) MASM Programs use the Microsoft  flat  memory model Motaz K. Saad, Dept. of CS
What's Next General Concepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
Components of an IA-32 Microcomputer Motherboard Video output Memory Input-output ports Motaz K. Saad, Dept. of CS
Motherboard CPU socket External cache memory slots Main memory slots BIOS chips Sound synthesizer chip (optional) Video controller chip (optional) IDE, parallel, serial, USB, video, keyboard, joystick, network, and mouse connectors PCI bus connectors (expansion cards) Motaz K. Saad, Dept. of CS
Intel D850MD Motherboard dynamic RAM  Pentium 4 socket Speaker IDE drive connectors mouse, keyboard, parallel, serial, and USB connectors AGP slot Battery Video Power connector memory controller hub Diskette connector PCI slots I/O Controller Firmware hub Audio chip Source: Intel® Desktop Board D850MD/D850MV Technical Product Specification Motaz K. Saad, Dept. of CS
Video Output Video controller on motherboard, or on expansion card AGP ( accelerated graphics port technology ) Video memory (VRAM) Video CRT Display uses raster scanning horizontal retrace vertical retrace Direct digital LCD monitors no raster scanning required Motaz K. Saad, Dept. of CS
Sample Video Controller (ATI Corp.) 128-bit 3D graphics performance powered by RAGE™ 128 PRO  3D graphics performance  Intelligent TV-Tuner with Digital VCR  TV-ON-DEMAND ™  Interactive Program Guide  Still image and MPEG-2 motion video capture  Video editing  Hardware DVD video playback  Video output to TV or VCR  Motaz K. Saad, Dept. of CS
Memory ROM read-only memory EPROM erasable programmable read-only memory Dynamic RAM (DRAM) inexpensive; must be refreshed constantly Static RAM (SRAM) expensive; used for cache memory; no refresh required Video RAM (VRAM) dual ported; optimized for constant video refresh CMOS RAM complimentary metal-oxide semiconductor system setup information See:  Intel platform memory  (Intel technology brief) Motaz K. Saad, Dept. of CS
Input-Output Ports USB (universal serial bus) intelligent high-speed connection to devices up to 12 megabits/second USB hub connects multiple devices enumeration : computer queries devices supports  hot  connections Parallel short cable, high speed common for printers bidirectional, parallel data transfer Intel 8255 controller chip Motaz K. Saad, Dept. of CS
Input-Output Ports  (cont) Serial RS-232 serial port one bit at a time uses long cables and modems 16550 UART (universal asynchronous receiver transmitter) programmable in assembly language Motaz K. Saad, Dept. of CS
What's Next General Concepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
Levels of Input-Output Level 3: Call a library function (C++, Java) easy to do; abstracted from hardware; details hidden slowest performance Level 2: Call an operating system function specific to one OS; device-independent medium performance Level 1: Call a BIOS (basic input-output system) function may produce different results on different systems knowledge of hardware required usually good performance Level 0: Communicate directly with the hardware May not be allowed by some operating systems Motaz K. Saad, Dept. of CS
Displaying a String of Characters When a HLL program displays a string of characters, the following steps take place: Motaz K. Saad, Dept. of CS
ASM Programming levels ASM programs can perform input-output at each of the following levels: Motaz K. Saad, Dept. of CS
Summary Central Processing Unit (CPU) Arithmetic Logic Unit (ALU) Instruction execution cycle Multitasking Floating Point Unit (FPU) Complex Instruction Set Real mode and Protected mode Motherboard components Memory types Input/Output and access levels Motaz K. Saad, Dept. of CS
More Details about X86 Family Architecture  X86 family Generations Motaz K. Saad, Dept. of CS
X86 Family 8086 and 8088 Microprocessors 80x86 architecture   Address bus : 20 bits, 16 bits for 8-bit chips     Max. memory capacity : 1 Mbytes  Internal structure is divided into BIU and EU     Fetch and instruction execution can occur simultaneously Length of internal registers expanded from 8 bit to 16 bit    H/W multiply and divide instructions built into the processor Support for an external math coprocessor for floating-point   operations in H/W as much as 100 times faster Motaz K. Saad, Dept. of CS
Intel 8085 architecture : 8-bit data, 16-bit address Motaz K. Saad, Dept. of CS
Internal architecture of 8086 Motaz K. Saad, Dept. of CS
PC Standard   For 16bit data bus, two 8-bit memory banks are required     expensive at the time in 1979, Intel announced 8088  µ -P that is identical to the   8086 except an external 8-bit data bus.     Two memory accesses are needed to input a word. IBM announced the IBM-PC, using 8088  µ -P and 16 KB   memory (expandable to 64 KB).    Clock speed : 4.77 MHz  -------- PC standard is defined. Motaz K. Saad, Dept. of CS
80186 and 80188 Microprocessors High-integration CPUs  : includes 8086 (or 8088) core    and a clock generator, a programmable timer, an    interrupt controller, a DMA controller, etc. Instruction set is fully compatible to 8086 and 8088, but    include 9 new instructions. Used for IBM-PC compatibles and many embedded   computers. Motaz K. Saad, Dept. of CS
80286 Microprocessor Processor of IBM PC-AT Provide two programming modes Real mode    - functions exactly same as 8086  - use only 20 least significant address lines (max. 1 MB)  - faster than 8086 due to redesigning and higher clock Protected mode   - 16 new instructions are added - support multi-program environment by giving each   program a predetermined amount of memory (16 MB) - programs no longer have physical addresses, but are    addressed by a  segment selector - Several programs can be loaded into memory at the   same time, but protected from each other  (*MS-DOS) Motaz K. Saad, Dept. of CS
The 8086 and 80286 microprocessors. Motaz K. Saad, Dept. of CS
80386 Microprocessor New Standard  announced (1985) by Intel with commitment   of successive u-P generations being remained compatible   with this chip, Intel Architecture-32 ( IA-32 ) thru 2000. Data bus & internal registers :  32  bits Address bus : 32 bits    max. 4 GB of physical memory Motaz K. Saad, Dept. of CS
Internal architecture of 80386 Motaz K. Saad, Dept. of CS
Internal registers (partly) of 80386 Motaz K. Saad, Dept. of CS
80386 supports two operating modes (like 80286)    1)  Real Address Mode     - used by MS-DOS   - in this mode, 80386 becomes a fast 8086.      2)  Protected Virtual Address Mode  (Protected Mode)    - On-board MMU manages 4 GB of memory   - Each task is given a segment of memory governed   by a  descriptor register , that defines the segment   base address, the segment limit, and the attributes   for the segment (code, data, read-only, etc.)   - Use  paging  technique : 4 KB pages can be swapped   in and out of memory (using a disk) to allow a task    to have a virtual memory space as large as 64 TB. Motaz K. Saad, Dept. of CS
When operating with 64 KB of cache, the 386 achieves   a hit rate of 93%     the processor operates at  full    speed 93% of the time Instruction set of 386 is 100% compatible with the    older processors in the family. 14 new instructions are added and several others are   modified.   [ex] data can be moved between the internal registers   at a time. 80386SX  : designed to ease the transition from 16- to   32-bit processors ---  16-bit external data bus  and    24-bit address bus . Motaz K. Saad, Dept. of CS
80486 Microprocessor   Maintain compatibility with the older u-Ps Only 6 new instructions are added to be used by OS   S/W, not by application programs.  Redesigned using RISC concepts    frequently used   instructions to execute in a single clock cycle. New 5-stage instruction execution pipeline      5 instructions can be executed at once. On-board 8K cache and 80387 coprocessor      twice faster than 386  (20 MHz 387 = 40 MHz 386) 486SX  : excludes 80387, designed for low-end appli-   cations that do not require a coprocessor.  Motaz K. Saad, Dept. of CS
486DX2 and DX4   DX2 : the internal clock rate is  twice  the external clock. DX4 : the internal clock rate is  three times .     Allow to use  less expensive  components on the    computer system board, while the processor operate    at its maximum data rate (internally).   [Ex] 486DX2 66 : 66 MHz (int. clock) & 33 MHz (ext. clock)   486DX4 100 : 100 MHz (int. clock) & 33 MHz (ext. clock) Overdrive Processors   :  486 system boards include an over- drive socket to allow users to upgrade low-speed 486DX or 486SX with 486DX2 and DX4 style processors.  Motaz K. Saad, Dept. of CS
Pentium  Superscalar Architecture  : provides two instruction  execution pipelines, each with its own ALU, address generation circuitry, and data cache interface.    execute two different instructions simultaneously Additional Features : includes on-board cache (separate 8K instruction cache   and data cache) and a coprocessor 8-stage instruction pipelines achieves 5~8 times floating-point performance of 486 external data bus : 64 bits about twice as fast as the 486  Motaz K. Saad, Dept. of CS
Key features of the Pentium microprocessor. The execution unit has two pipelines allowing two instructions to be executed simultaneously. Motaz K. Saad, Dept. of CS
MMX (Multimedia Extension)  : provides 3 architectural enhancements over non-MMX Pentium 57 instructions are added for multimedia (audio, video, and graphic data) applications.  SIMD(Single-Instruction stream Multiple-Data stream) allows the same operation to be performed on multiple data items. Because many multimedia applications  require large blocks of data to be manipulated, SIMD  provides a significant performance enhancement.  Internal cache size is increased from 16K to 32K. For general applications, 10~20% performance improved. For multimedia applications, nearly 70%  improved.  Motaz K. Saad, Dept. of CS
Socket 7  : ZIF(zero insertion force) socket Pentium chip : 296-pin PGA package. A heat sink   and fan are mounted atop the chip, and the entire   assembly plugged into a ZIF, so-called socket 7. Socket 7 defines a  platform  that defines the front side   bus connection to the L2 cache, disk interface, video   interface, and the ISA and PCI expansion buses.  Motaz K. Saad, Dept. of CS
Pentium processor with heat sink and fan mated to a Socket 7 connector. Motaz K. Saad, Dept. of CS
Pentium Pro 6 th  - generation processors   ( Pentium Pro, Pentium II, Pentium III and Celeron )  36 address lines    max. 64 GB memory New features    1. Inclusion of L2 cache in the same package with proc.   2. New system board platform called Socket 8 (Pro), slot   1 & 2 (Pentium II, III, and Celeron), and Socket 370    (Pentium III and Celeron).   3. New instruction architecture based on Dynamic Execution Two chips in One Package  : Pentium Pro consists of two  separate silicon dies  –  one for the processor and the other for 256KB L2 cache.  Motaz K. Saad, Dept. of CS
The Pentium Pro is two chips in one. The larger die is the processor, the smaller a 256K L2 cache. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Dynamic Execution   : a new approach to processing S/W instructions that reduces idle processor time. Multiple Branch Prediction  : Pentium Pro can look   as far as 30 instructions ahead to anticipate conditional branches    reduce waste of pipeline clocks. Data Flow Analysis  : looks at upcoming S/W instruc- tions for the optimal sequence of processing. Speculative Execution  : allows to execute instructions in a different order from which they are entered the processor =  “ out-of-order execution ” . The result of these instructions are stored as speculative results until their final states can be determined. Motaz K. Saad, Dept. of CS
Superscalar Processor of Degree Three  : Pentium has three instruction decoders, and can execute 3 simul- taneous instructions. Internal Cache  : L2 cache in the same package. Motaz K. Saad, Dept. of CS
Pentium II   Pentium Pro is dead (short life) due to  the lack of MMX instructions use of the expensive dual- and tri-cavity package Pentium II is a Pentium Pro with MMX technology,    repackaged in a new  single-edge contact(SEC)  cartridge   that is inserted in  “ Slot 1 connector  –  242 pins ”   or   “ Slot 2 connector  –  330 pins Processor and L2 are mounted on a ceramic substrate   (silicon dies are separate) Processor clock : 300 ~ 450 MHz, bus clock : 100 MHz L1(32 KB) & L2(512 KB) with 64-bit dedicated bus Motaz K. Saad, Dept. of CS
Exploded view of single-edge contact (SEC) cartridge. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Installing the SEC cartridge into the retention mechanism. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Celeron Pentium II without L2 cache  (Pentium II SX ?) Use the slot 1 connector without the plastic cover   called  “ naked CPU ” Celeron A  : Include 128KB  L2 cache on the same die    with processor. Drawback : 66 MHz bus cycle 370-pin PGA package (called Socket 370) Motaz K. Saad, Dept. of CS
The Celeron processor is a Pentium II without the L2 cache. Later versions, called the Celeron A, include this cache on the same silicon die with the processor. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Pentium III H igher clock speed  : b ased on the Pentium II core, with    600MHz clock and an external bus freq. of 133MHz 70 new streaming SIMD extensions (SSE) : 50 to improve floating-point performance 12 to improve multimedia processing 8 to improve the efficiency of L1 cache Motaz K. Saad, Dept. of CS
The Pentium III microprocessor with integrated L2 cache. This chip has more than 22 million transistors. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Xeon Processors Scalability : As processing demands increase, additional   processors can be interconnected to keep pace.    - One of the advantages of Pentium Pro that can support   up to 4 processors  ;  SMP (symmetric multiprocessing) Pentium II Xeon processor  can be scaled to 2, 4, 8 or more,   and used for high-end server and workstations. Pentium III Xeon processor  : similar but offer the strea-   ming SIMD technology. Motaz K. Saad, Dept. of CS
P7 Itanium IA-64  : 7 th -generation processor architecture,     Code name =  Merced 64-bit architecture  : 128  64-bit registers & 128  82-bit   floating-point registers (including hidden bits)   [c.f.] IA-32 :  10  32-bit reg., 8  fl-pt. reg. Explicit parallelism  : instructions are packed in 128-bit   bundles ready for execution. Each bundle consists of   3 41-bit instructions and 5-bit template. All three inst-   ructions are dispatched in parallel Motaz K. Saad, Dept. of CS
Speculation  : preload data to minimize memory    delays when data is needed Predication  : When a conditional branch instruction   is encountered, Itanium follows both branch paths,   then commits the results of the correct path only. Data bus : 128 bits Address bus : 64 bits    max. 2 64  bytes memory Motaz K. Saad, Dept. of CS
80x86 Compatible Microprocessors Second Sources : manufacturing 80x86 u-P chips after   licensed by Intel.  Clones and Look-Alikes   Pin-for-pin replacements with all of the same fea-   tures as the Intel processor.    [Ex] AMD 386DX, 486DX4-100, Cyrix 5x86, etc. Motaz K. Saad, Dept. of CS
The AMD K7 or Athlon processor. It mates to a new proprietary socket called Slot A. (Courtesy of Advanced Micro Devices.) Motaz K. Saad, Dept. of CS
Measuring Processor Performance   Benchmark programs  : used to measure the performance   of a computer system (system benchmarks) or of a com-   ponent in that system such as the processor, disk, video   card, or main memory (component benchmarks).  Component-level Benchmarks Whetstone  : used to measure the time to execute    integer and floating-point arithmetic instructions and   “ if ”  statements.   --- including a high percentage of fl.pt. operations     mostly used to represent numerical programs. Motaz K. Saad, Dept. of CS
Dhrystone  : a synthetic benchmark consisting of   12 procedures with 94 statements, no fl.-pt. ops. Microprocessor Benchmarks : developed for compa-   ring the processing ability of the vaious u-P chips.   --- Ziff-Davis ’  CPUmark and Intel ’ s iCOMP index. CPUmark  : measures the speed of a PC ’ s proc-   essor subsystem, including the CPU, its internal   and external caches, and system RAM.   [Ex] Fig. 1-20 : CPUmark99 ratings for 80x86s iCOMP  : combines 4 industry standard benchmarks   : CPUmark32, Norton SI32, SPEC95, and the Intel   Media Benchmark (audio, vedio, image, 3-D, etc.). Motaz K. Saad, Dept. of CS
CPUmark is a benchmark that measures the speed of the processor and its internal cache. Motaz K. Saad, Dept. of CS
System-level Benchmarks Microcomputer Benchmarks   :   measures the speed   of processor with considering a slow disk or video   subsystem. Winston  :  System-level, application-based benchmark   to measure a PC ’ s overall performance when running   today ’ s 32-bit applications on Window 95, 98, NT.    [Ex]  Winstone 98 ratings for 80x86s Performance Rating  : Cyrix and AMD developed the P-rating  (Processor Performance) system --- running applications on a processor and compare to a Pentium u-P.  [Ex] Table 1-2 : PR166 ~ 366 for AMD and Cyrix chips Motaz K. Saad, Dept. of CS
Winstone 98 measures the performance of a PC system running typical Windows applications. Motaz K. Saad, Dept. of CS

The x86 Family

  • 1.
    Assembly Language x86Family Architecture Motaz K. Saad Spring 2007 Motaz K. Saad, Dept. of CS
  • 2.
    Overview General ConceptsIA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
  • 3.
    General Concepts Basicmicrocomputer design Instruction execution cycle Reading from memory How programs run Motaz K. Saad, Dept. of CS
  • 4.
    Basic Microcomputer DesignClock synchronizes CPU operations Control unit (CU) coordinates sequence of execution steps ALU performs arithmetic and bitwise processing Motaz K. Saad, Dept. of CS
  • 5.
    Motaz K. Saad,Dept. of CS Processor Control Unit Arithmetic Logic Unit (ALU) Arithmetic Logic Unit (ALU) Input Devices Storage Devices Output Devices Memory Data Information Instructions Data Information Instructions Data Information Control Unit
  • 6.
    Motaz K. Saad,Dept. of CS
  • 7.
    Clock Synchronizes allCPU and BUS operations Machine (clock) cycle measures time of a single operation Clock is used to trigger events Motaz K. Saad, Dept. of CS
  • 8.
    What's Next GeneralConcepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
  • 9.
    Instruction Execution CycleFetch Decode Fetch operands Execute Store output Motaz K. Saad, Dept. of CS
  • 10.
    Cache Memory High-speedexpensive static RAM both inside and outside the CPU. Level-1 cache: inside the CPU Level-2 cache: outside the CPU Cache hit: when data to be read is already in cache memory Cache miss: when data to be read is not in cache memory. Motaz K. Saad, Dept. of CS
  • 11.
    How a ProgramRuns Motaz K. Saad, Dept. of CS
  • 12.
    Multitasking OS canrun multiple programs at the same time. Multiple threads of execution within the same program. Scheduler utility assigns a given amount of CPU time to each running program. Rapid switching of tasks gives illusion that all programs are running at once the processor must support task switching. Motaz K. Saad, Dept. of CS
  • 13.
    IA-32 Processor ArchitectureModes of operation Basic execution environment Floating-point unit Intel Microprocessor history Motaz K. Saad, Dept. of CS
  • 14.
    Modes of OperationProtected mode native mode (Windows, Linux) Real-address mode native MS-DOS System management mode power management, system security, diagnostics Virtual-8086 mode hybrid of Protected each program has its own 8086 computer Motaz K. Saad, Dept. of CS
  • 15.
    Basic Execution EnvironmentAddressable memory General-purpose registers Index and base registers Specialized register uses Status flags Floating-point, MMX, XMM registers Motaz K. Saad, Dept. of CS
  • 16.
    Addressable Memory Protectedmode 4 GB 32-bit address Real-address and Virtual-8086 modes 1 MB space 20-bit address Motaz K. Saad, Dept. of CS
  • 17.
    X86 General-Purpose RegistersNamed storage locations inside the CPU, optimized for speed. Motaz K. Saad, Dept. of CS
  • 18.
    Accessing Parts ofRegisters Use 8-bit name, 16-bit name, or 32-bit name Applies to EAX, EBX, ECX, and EDX Motaz K. Saad, Dept. of CS
  • 19.
    Index and BaseRegisters Some registers have only a 16-bit name for their lower half: Motaz K. Saad, Dept. of CS
  • 20.
    Some Specialized RegisterUses Segment CS – code segment DS – data segment SS – stack segment ES, FS, GS - additional segments EIP – instruction pointer EFLAGS status and control flags each flag is a single binary bit General-Purpose EAX – accumulator EBX – base register ECX – loop counter EDX – data register ESP – stack pointer ESI, EDI – index registers EBP – extended frame pointer (stack) Motaz K. Saad, Dept. of CS
  • 21.
    Status Flags Carryunsigned arithmetic out of range Overflow signed arithmetic out of range Sign result is negative Zero result is zero Auxiliary Carry carry from bit 3 to bit 4 Parity sum of 1 bits is an even number Motaz K. Saad, Dept. of CS
  • 22.
    Intel Microprocessor HistoryIntel 8086, 80286 IA-32 processor family P6 processor family CISC and RISC Motaz K. Saad, Dept. of CS
  • 23.
    Early Intel MicroprocessorsIntel 8080 64K addressable RAM 8-bit registers CP/M operating system S-100 BUS architecture 8-inch floppy disks! Intel 8086/8088 IBM-PC Used 8088 1 MB addressable RAM 16-bit registers 16-bit data bus (8-bit for 8088) separate floating-point unit (8087) Motaz K. Saad, Dept. of CS
  • 24.
    The IBM-AT Intel80286 16 MB addressable RAM Protected memory several times faster than 8086 introduced IDE bus architecture 80287 floating point unit Motaz K. Saad, Dept. of CS
  • 25.
    Intel IA-32 FamilyIntel386 4 GB addressable RAM, 32-bit registers, paging (virtual memory) Intel486 instruction pipelining Pentium superscalar, 32-bit address bus, 64-bit internal data path Motaz K. Saad, Dept. of CS
  • 26.
    Intel P6 FamilyPentium Pro advanced optimization techniques in microcode Pentium II MMX (multimedia) instruction set Pentium III SIMD (streaming extensions) instructions Pentium 4 and Xeon Intel NetBurst micro-architecture, tuned for multimedia Motaz K. Saad, Dept. of CS
  • 27.
    CISC and RISCCISC – complex instruction set large instruction set high-level operations requires microcode interpreter examples: Intel 80x86 family RISC – reduced instruction set simple, atomic instructions small instruction set directly executed by hardware examples: ARM (Advanced RISC Machines) DEC Alpha (now Compaq) Motaz K. Saad, Dept. of CS
  • 28.
    What's Next GeneralConcepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
  • 29.
    IA-32 Memory ManagementReal-address mode Calculating linear addresses Protected mode Multi-segment model Paging Motaz K. Saad, Dept. of CS
  • 30.
    Real-Address mode 1MB RAM maximum addressable Application programs can access any area of memory Single tasking Supported by MS-DOS operating system Motaz K. Saad, Dept. of CS
  • 31.
    Segmented Memory Segmentedmemory addressing: absolute (linear) address is a combination of a 16-bit segment value added to a 16-bit offset linear addresses one segment Motaz K. Saad, Dept. of CS
  • 32.
    Calculating Linear AddressesGiven a segment address, multiply it by 16 (add a hexadecimal zero), and add it to the offset Example: convert 08F1:0100 to a linear address Adjusted Segment value: 0 8 F 1 0 Add the offset: 0 1 0 0 Linear address: 0 9 0 1 0 Motaz K. Saad, Dept. of CS
  • 33.
    Your turn .. . What linear address corresponds to the segment/offset address 028F:0030? 028F0 + 0030 = 02920 Always use hexadecimal notation for addresses. Motaz K. Saad, Dept. of CS
  • 34.
    Your turn .. . What segment addresses correspond to the linear address 28F30h? Many different segment-offset addresses can produce the linear address 28F30h. For example: 28F0:0030, 28F3:0000, 28B0:0430, . . . Motaz K. Saad, Dept. of CS
  • 35.
    Protected Mode (1 of 2) 4 GB addressable RAM (00000000 to FFFFFFFFh) Each program assigned a memory partition which is protected from other programs Designed for multitasking Supported by Linux & MS-Windows Motaz K. Saad, Dept. of CS
  • 36.
    Protected mode (2 of 2) Segment descriptor tables Program structure code, data, and stack areas CS, DS, SS segment descriptors global descriptor table (GDT) MASM Programs use the Microsoft flat memory model Motaz K. Saad, Dept. of CS
  • 37.
    What's Next GeneralConcepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
  • 38.
    Components of anIA-32 Microcomputer Motherboard Video output Memory Input-output ports Motaz K. Saad, Dept. of CS
  • 39.
    Motherboard CPU socketExternal cache memory slots Main memory slots BIOS chips Sound synthesizer chip (optional) Video controller chip (optional) IDE, parallel, serial, USB, video, keyboard, joystick, network, and mouse connectors PCI bus connectors (expansion cards) Motaz K. Saad, Dept. of CS
  • 40.
    Intel D850MD Motherboarddynamic RAM Pentium 4 socket Speaker IDE drive connectors mouse, keyboard, parallel, serial, and USB connectors AGP slot Battery Video Power connector memory controller hub Diskette connector PCI slots I/O Controller Firmware hub Audio chip Source: Intel® Desktop Board D850MD/D850MV Technical Product Specification Motaz K. Saad, Dept. of CS
  • 41.
    Video Output Videocontroller on motherboard, or on expansion card AGP ( accelerated graphics port technology ) Video memory (VRAM) Video CRT Display uses raster scanning horizontal retrace vertical retrace Direct digital LCD monitors no raster scanning required Motaz K. Saad, Dept. of CS
  • 42.
    Sample Video Controller(ATI Corp.) 128-bit 3D graphics performance powered by RAGE™ 128 PRO 3D graphics performance Intelligent TV-Tuner with Digital VCR TV-ON-DEMAND ™ Interactive Program Guide Still image and MPEG-2 motion video capture Video editing Hardware DVD video playback Video output to TV or VCR Motaz K. Saad, Dept. of CS
  • 43.
    Memory ROM read-onlymemory EPROM erasable programmable read-only memory Dynamic RAM (DRAM) inexpensive; must be refreshed constantly Static RAM (SRAM) expensive; used for cache memory; no refresh required Video RAM (VRAM) dual ported; optimized for constant video refresh CMOS RAM complimentary metal-oxide semiconductor system setup information See: Intel platform memory (Intel technology brief) Motaz K. Saad, Dept. of CS
  • 44.
    Input-Output Ports USB(universal serial bus) intelligent high-speed connection to devices up to 12 megabits/second USB hub connects multiple devices enumeration : computer queries devices supports hot connections Parallel short cable, high speed common for printers bidirectional, parallel data transfer Intel 8255 controller chip Motaz K. Saad, Dept. of CS
  • 45.
    Input-Output Ports (cont) Serial RS-232 serial port one bit at a time uses long cables and modems 16550 UART (universal asynchronous receiver transmitter) programmable in assembly language Motaz K. Saad, Dept. of CS
  • 46.
    What's Next GeneralConcepts IA-32 Processor Architecture IA-32 Memory Management Components of an IA-32 Microcomputer Input-Output System Motaz K. Saad, Dept. of CS
  • 47.
    Levels of Input-OutputLevel 3: Call a library function (C++, Java) easy to do; abstracted from hardware; details hidden slowest performance Level 2: Call an operating system function specific to one OS; device-independent medium performance Level 1: Call a BIOS (basic input-output system) function may produce different results on different systems knowledge of hardware required usually good performance Level 0: Communicate directly with the hardware May not be allowed by some operating systems Motaz K. Saad, Dept. of CS
  • 48.
    Displaying a Stringof Characters When a HLL program displays a string of characters, the following steps take place: Motaz K. Saad, Dept. of CS
  • 49.
    ASM Programming levelsASM programs can perform input-output at each of the following levels: Motaz K. Saad, Dept. of CS
  • 50.
    Summary Central ProcessingUnit (CPU) Arithmetic Logic Unit (ALU) Instruction execution cycle Multitasking Floating Point Unit (FPU) Complex Instruction Set Real mode and Protected mode Motherboard components Memory types Input/Output and access levels Motaz K. Saad, Dept. of CS
  • 51.
    More Details aboutX86 Family Architecture X86 family Generations Motaz K. Saad, Dept. of CS
  • 52.
    X86 Family 8086and 8088 Microprocessors 80x86 architecture Address bus : 20 bits, 16 bits for 8-bit chips  Max. memory capacity : 1 Mbytes Internal structure is divided into BIU and EU  Fetch and instruction execution can occur simultaneously Length of internal registers expanded from 8 bit to 16 bit H/W multiply and divide instructions built into the processor Support for an external math coprocessor for floating-point operations in H/W as much as 100 times faster Motaz K. Saad, Dept. of CS
  • 53.
    Intel 8085 architecture: 8-bit data, 16-bit address Motaz K. Saad, Dept. of CS
  • 54.
    Internal architecture of8086 Motaz K. Saad, Dept. of CS
  • 55.
    PC Standard For 16bit data bus, two 8-bit memory banks are required  expensive at the time in 1979, Intel announced 8088 µ -P that is identical to the 8086 except an external 8-bit data bus.  Two memory accesses are needed to input a word. IBM announced the IBM-PC, using 8088 µ -P and 16 KB memory (expandable to 64 KB). Clock speed : 4.77 MHz -------- PC standard is defined. Motaz K. Saad, Dept. of CS
  • 56.
    80186 and 80188Microprocessors High-integration CPUs : includes 8086 (or 8088) core and a clock generator, a programmable timer, an interrupt controller, a DMA controller, etc. Instruction set is fully compatible to 8086 and 8088, but include 9 new instructions. Used for IBM-PC compatibles and many embedded computers. Motaz K. Saad, Dept. of CS
  • 57.
    80286 Microprocessor Processorof IBM PC-AT Provide two programming modes Real mode - functions exactly same as 8086 - use only 20 least significant address lines (max. 1 MB) - faster than 8086 due to redesigning and higher clock Protected mode - 16 new instructions are added - support multi-program environment by giving each program a predetermined amount of memory (16 MB) - programs no longer have physical addresses, but are addressed by a segment selector - Several programs can be loaded into memory at the same time, but protected from each other (*MS-DOS) Motaz K. Saad, Dept. of CS
  • 58.
    The 8086 and80286 microprocessors. Motaz K. Saad, Dept. of CS
  • 59.
    80386 Microprocessor NewStandard announced (1985) by Intel with commitment of successive u-P generations being remained compatible with this chip, Intel Architecture-32 ( IA-32 ) thru 2000. Data bus & internal registers : 32 bits Address bus : 32 bits  max. 4 GB of physical memory Motaz K. Saad, Dept. of CS
  • 60.
    Internal architecture of80386 Motaz K. Saad, Dept. of CS
  • 61.
    Internal registers (partly)of 80386 Motaz K. Saad, Dept. of CS
  • 62.
    80386 supports twooperating modes (like 80286) 1) Real Address Mode - used by MS-DOS - in this mode, 80386 becomes a fast 8086. 2) Protected Virtual Address Mode (Protected Mode) - On-board MMU manages 4 GB of memory - Each task is given a segment of memory governed by a descriptor register , that defines the segment base address, the segment limit, and the attributes for the segment (code, data, read-only, etc.) - Use paging technique : 4 KB pages can be swapped in and out of memory (using a disk) to allow a task to have a virtual memory space as large as 64 TB. Motaz K. Saad, Dept. of CS
  • 63.
    When operating with64 KB of cache, the 386 achieves a hit rate of 93%  the processor operates at full speed 93% of the time Instruction set of 386 is 100% compatible with the older processors in the family. 14 new instructions are added and several others are modified. [ex] data can be moved between the internal registers at a time. 80386SX : designed to ease the transition from 16- to 32-bit processors --- 16-bit external data bus and 24-bit address bus . Motaz K. Saad, Dept. of CS
  • 64.
    80486 Microprocessor Maintain compatibility with the older u-Ps Only 6 new instructions are added to be used by OS S/W, not by application programs. Redesigned using RISC concepts  frequently used instructions to execute in a single clock cycle. New 5-stage instruction execution pipeline  5 instructions can be executed at once. On-board 8K cache and 80387 coprocessor  twice faster than 386 (20 MHz 387 = 40 MHz 386) 486SX : excludes 80387, designed for low-end appli- cations that do not require a coprocessor. Motaz K. Saad, Dept. of CS
  • 65.
    486DX2 and DX4 DX2 : the internal clock rate is twice the external clock. DX4 : the internal clock rate is three times .  Allow to use less expensive components on the computer system board, while the processor operate at its maximum data rate (internally). [Ex] 486DX2 66 : 66 MHz (int. clock) & 33 MHz (ext. clock) 486DX4 100 : 100 MHz (int. clock) & 33 MHz (ext. clock) Overdrive Processors : 486 system boards include an over- drive socket to allow users to upgrade low-speed 486DX or 486SX with 486DX2 and DX4 style processors. Motaz K. Saad, Dept. of CS
  • 66.
    Pentium SuperscalarArchitecture : provides two instruction execution pipelines, each with its own ALU, address generation circuitry, and data cache interface.  execute two different instructions simultaneously Additional Features : includes on-board cache (separate 8K instruction cache and data cache) and a coprocessor 8-stage instruction pipelines achieves 5~8 times floating-point performance of 486 external data bus : 64 bits about twice as fast as the 486 Motaz K. Saad, Dept. of CS
  • 67.
    Key features ofthe Pentium microprocessor. The execution unit has two pipelines allowing two instructions to be executed simultaneously. Motaz K. Saad, Dept. of CS
  • 68.
    MMX (Multimedia Extension) : provides 3 architectural enhancements over non-MMX Pentium 57 instructions are added for multimedia (audio, video, and graphic data) applications. SIMD(Single-Instruction stream Multiple-Data stream) allows the same operation to be performed on multiple data items. Because many multimedia applications require large blocks of data to be manipulated, SIMD provides a significant performance enhancement. Internal cache size is increased from 16K to 32K. For general applications, 10~20% performance improved. For multimedia applications, nearly 70% improved. Motaz K. Saad, Dept. of CS
  • 69.
    Socket 7 : ZIF(zero insertion force) socket Pentium chip : 296-pin PGA package. A heat sink and fan are mounted atop the chip, and the entire assembly plugged into a ZIF, so-called socket 7. Socket 7 defines a platform that defines the front side bus connection to the L2 cache, disk interface, video interface, and the ISA and PCI expansion buses. Motaz K. Saad, Dept. of CS
  • 70.
    Pentium processor withheat sink and fan mated to a Socket 7 connector. Motaz K. Saad, Dept. of CS
  • 71.
    Pentium Pro 6th - generation processors ( Pentium Pro, Pentium II, Pentium III and Celeron ) 36 address lines  max. 64 GB memory New features 1. Inclusion of L2 cache in the same package with proc. 2. New system board platform called Socket 8 (Pro), slot 1 & 2 (Pentium II, III, and Celeron), and Socket 370 (Pentium III and Celeron). 3. New instruction architecture based on Dynamic Execution Two chips in One Package : Pentium Pro consists of two separate silicon dies – one for the processor and the other for 256KB L2 cache. Motaz K. Saad, Dept. of CS
  • 72.
    The Pentium Prois two chips in one. The larger die is the processor, the smaller a 256K L2 cache. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
  • 73.
    Dynamic Execution : a new approach to processing S/W instructions that reduces idle processor time. Multiple Branch Prediction : Pentium Pro can look as far as 30 instructions ahead to anticipate conditional branches  reduce waste of pipeline clocks. Data Flow Analysis : looks at upcoming S/W instruc- tions for the optimal sequence of processing. Speculative Execution : allows to execute instructions in a different order from which they are entered the processor = “ out-of-order execution ” . The result of these instructions are stored as speculative results until their final states can be determined. Motaz K. Saad, Dept. of CS
  • 74.
    Superscalar Processor ofDegree Three : Pentium has three instruction decoders, and can execute 3 simul- taneous instructions. Internal Cache : L2 cache in the same package. Motaz K. Saad, Dept. of CS
  • 75.
    Pentium II Pentium Pro is dead (short life) due to the lack of MMX instructions use of the expensive dual- and tri-cavity package Pentium II is a Pentium Pro with MMX technology, repackaged in a new single-edge contact(SEC) cartridge that is inserted in “ Slot 1 connector – 242 pins ” or “ Slot 2 connector – 330 pins Processor and L2 are mounted on a ceramic substrate (silicon dies are separate) Processor clock : 300 ~ 450 MHz, bus clock : 100 MHz L1(32 KB) & L2(512 KB) with 64-bit dedicated bus Motaz K. Saad, Dept. of CS
  • 76.
    Exploded view ofsingle-edge contact (SEC) cartridge. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
  • 77.
    Installing the SECcartridge into the retention mechanism. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
  • 78.
    Celeron Pentium IIwithout L2 cache (Pentium II SX ?) Use the slot 1 connector without the plastic cover called “ naked CPU ” Celeron A : Include 128KB L2 cache on the same die with processor. Drawback : 66 MHz bus cycle 370-pin PGA package (called Socket 370) Motaz K. Saad, Dept. of CS
  • 79.
    The Celeron processoris a Pentium II without the L2 cache. Later versions, called the Celeron A, include this cache on the same silicon die with the processor. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
  • 80.
    Pentium III Higher clock speed : b ased on the Pentium II core, with 600MHz clock and an external bus freq. of 133MHz 70 new streaming SIMD extensions (SSE) : 50 to improve floating-point performance 12 to improve multimedia processing 8 to improve the efficiency of L1 cache Motaz K. Saad, Dept. of CS
  • 81.
    The Pentium IIImicroprocessor with integrated L2 cache. This chip has more than 22 million transistors. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
  • 82.
    Xeon Processors Scalability: As processing demands increase, additional processors can be interconnected to keep pace. - One of the advantages of Pentium Pro that can support up to 4 processors ; SMP (symmetric multiprocessing) Pentium II Xeon processor can be scaled to 2, 4, 8 or more, and used for high-end server and workstations. Pentium III Xeon processor : similar but offer the strea- ming SIMD technology. Motaz K. Saad, Dept. of CS
  • 83.
    P7 Itanium IA-64 : 7 th -generation processor architecture, Code name = Merced 64-bit architecture : 128 64-bit registers & 128 82-bit floating-point registers (including hidden bits) [c.f.] IA-32 : 10 32-bit reg., 8 fl-pt. reg. Explicit parallelism : instructions are packed in 128-bit bundles ready for execution. Each bundle consists of 3 41-bit instructions and 5-bit template. All three inst- ructions are dispatched in parallel Motaz K. Saad, Dept. of CS
  • 84.
    Speculation :preload data to minimize memory delays when data is needed Predication : When a conditional branch instruction is encountered, Itanium follows both branch paths, then commits the results of the correct path only. Data bus : 128 bits Address bus : 64 bits  max. 2 64 bytes memory Motaz K. Saad, Dept. of CS
  • 85.
    80x86 Compatible MicroprocessorsSecond Sources : manufacturing 80x86 u-P chips after licensed by Intel. Clones and Look-Alikes Pin-for-pin replacements with all of the same fea- tures as the Intel processor. [Ex] AMD 386DX, 486DX4-100, Cyrix 5x86, etc. Motaz K. Saad, Dept. of CS
  • 86.
    The AMD K7or Athlon processor. It mates to a new proprietary socket called Slot A. (Courtesy of Advanced Micro Devices.) Motaz K. Saad, Dept. of CS
  • 87.
    Measuring Processor Performance Benchmark programs : used to measure the performance of a computer system (system benchmarks) or of a com- ponent in that system such as the processor, disk, video card, or main memory (component benchmarks). Component-level Benchmarks Whetstone : used to measure the time to execute integer and floating-point arithmetic instructions and “ if ” statements. --- including a high percentage of fl.pt. operations  mostly used to represent numerical programs. Motaz K. Saad, Dept. of CS
  • 88.
    Dhrystone :a synthetic benchmark consisting of 12 procedures with 94 statements, no fl.-pt. ops. Microprocessor Benchmarks : developed for compa- ring the processing ability of the vaious u-P chips. --- Ziff-Davis ’ CPUmark and Intel ’ s iCOMP index. CPUmark : measures the speed of a PC ’ s proc- essor subsystem, including the CPU, its internal and external caches, and system RAM. [Ex] Fig. 1-20 : CPUmark99 ratings for 80x86s iCOMP : combines 4 industry standard benchmarks : CPUmark32, Norton SI32, SPEC95, and the Intel Media Benchmark (audio, vedio, image, 3-D, etc.). Motaz K. Saad, Dept. of CS
  • 89.
    CPUmark is abenchmark that measures the speed of the processor and its internal cache. Motaz K. Saad, Dept. of CS
  • 90.
    System-level Benchmarks MicrocomputerBenchmarks : measures the speed of processor with considering a slow disk or video subsystem. Winston : System-level, application-based benchmark to measure a PC ’ s overall performance when running today ’ s 32-bit applications on Window 95, 98, NT. [Ex] Winstone 98 ratings for 80x86s Performance Rating : Cyrix and AMD developed the P-rating (Processor Performance) system --- running applications on a processor and compare to a Pentium u-P. [Ex] Table 1-2 : PR166 ~ 366 for AMD and Cyrix chips Motaz K. Saad, Dept. of CS
  • 91.
    Winstone 98 measuresthe performance of a PC system running typical Windows applications. Motaz K. Saad, Dept. of CS