Assembly Language x86 Family Architecture Motaz K. Saad Spring 2007 Motaz K. Saad, Dept. of CS
Overview
General Concepts
IA-32 Processor Architecture
IA-32 Memory Management
Components of an IA-32 Microcomputer
Input-Output System
Motaz K. Saad, Dept. of CS
General Concepts
Basic microcomputer design
Instruction execution cycle
Reading from memory
How programs run
Motaz K. Saad, Dept. of CS
Basic Microcomputer Design
Clock synchronizes CPU operations
Control unit (CU) coordinates sequence of execution steps
ALU performs arithmetic and bitwise processing
Motaz K. Saad, Dept. of CS
Motaz K. Saad, Dept. of CS Processor Control Unit Arithmetic Logic Unit (ALU) Arithmetic Logic Unit (ALU) Input Devices Storage Devices Output Devices Memory Data Information Instructions Data Information Instructions Data Information Control Unit
Motaz K. Saad, Dept. of CS
Clock
Synchronizes all CPU and BUS operations
Machine (clock) cycle measures time of a single operation
Clock is used to trigger events
Motaz K. Saad, Dept. of CS
What's Next
General Concepts
IA-32 Processor Architecture
IA-32 Memory Management
Components of an IA-32 Microcomputer
Input-Output System
Motaz K. Saad, Dept. of CS
Instruction Execution Cycle
Fetch
Decode
Fetch operands
Execute
Store output
Motaz K. Saad, Dept. of CS
Cache Memory
High-speed expensive static RAM both inside and outside the CPU.
Level-1 cache: inside the CPU
Level-2 cache: outside the CPU
Cache hit: when data to be read is already in cache memory
Cache miss: when data to be read is not in cache memory.
Motaz K. Saad, Dept. of CS
How a Program Runs Motaz K. Saad, Dept. of CS
Multitasking
OS can run multiple programs at the same time.
Multiple threads of execution within the same program.
Scheduler utility assigns a given amount of CPU time to each running program.
Rapid switching of tasks
gives illusion that all programs are running at once
the processor must support task switching.
Motaz K. Saad, Dept. of CS
IA-32 Processor Architecture
Modes of operation
Basic execution environment
Floating-point unit
Intel Microprocessor history
Motaz K. Saad, Dept. of CS
Modes of Operation
Protected mode
native mode (Windows, Linux)
Real-address mode
native MS-DOS
System management mode
power management, system security, diagnostics
Virtual-8086 mode
hybrid of Protected
each program has its own 8086 computer
Motaz K. Saad, Dept. of CS
Basic Execution Environment
Addressable memory
General-purpose registers
Index and base registers
Specialized register uses
Status flags
Floating-point, MMX, XMM registers
Motaz K. Saad, Dept. of CS
Addressable Memory
Protected mode
4 GB
32-bit address
Real-address and Virtual-8086 modes
1 MB space
20-bit address
Motaz K. Saad, Dept. of CS
X86 General-Purpose Registers Named storage locations inside the CPU, optimized for speed. Motaz K. Saad, Dept. of CS
Accessing Parts of Registers
Use 8-bit name, 16-bit name, or 32-bit name
Applies to EAX, EBX, ECX, and EDX
Motaz K. Saad, Dept. of CS
Index and Base Registers
Some registers have only a 16-bit name for their lower half:
superscalar, 32-bit address bus, 64-bit internal data path
Motaz K. Saad, Dept. of CS
Intel P6 Family
Pentium Pro
advanced optimization techniques in microcode
Pentium II
MMX (multimedia) instruction set
Pentium III
SIMD (streaming extensions) instructions
Pentium 4 and Xeon
Intel NetBurst micro-architecture, tuned for multimedia
Motaz K. Saad, Dept. of CS
CISC and RISC
CISC – complex instruction set
large instruction set
high-level operations
requires microcode interpreter
examples: Intel 80x86 family
RISC – reduced instruction set
simple, atomic instructions
small instruction set
directly executed by hardware
examples:
ARM (Advanced RISC Machines)
DEC Alpha (now Compaq)
Motaz K. Saad, Dept. of CS
What's Next
General Concepts
IA-32 Processor Architecture
IA-32 Memory Management
Components of an IA-32 Microcomputer
Input-Output System
Motaz K. Saad, Dept. of CS
IA-32 Memory Management
Real-address mode
Calculating linear addresses
Protected mode
Multi-segment model
Paging
Motaz K. Saad, Dept. of CS
Real-Address mode
1 MB RAM maximum addressable
Application programs can access any area of memory
Single tasking
Supported by MS-DOS operating system
Motaz K. Saad, Dept. of CS
Segmented Memory
Segmented memory addressing: absolute (linear) address is a combination of a 16-bit segment value added to a 16-bit offset
linear addresses one segment Motaz K. Saad, Dept. of CS
Calculating Linear Addresses
Given a segment address, multiply it by 16 (add a hexadecimal zero), and add it to the offset
Example: convert 08F1:0100 to a linear address
Adjusted Segment value: 0 8 F 1 0 Add the offset: 0 1 0 0 Linear address: 0 9 0 1 0 Motaz K. Saad, Dept. of CS
Your turn . . . What linear address corresponds to the segment/offset address 028F:0030? 028F0 + 0030 = 02920 Always use hexadecimal notation for addresses. Motaz K. Saad, Dept. of CS
Your turn . . . What segment addresses correspond to the linear address 28F30h? Many different segment-offset addresses can produce the linear address 28F30h. For example: 28F0:0030, 28F3:0000, 28B0:0430, . . . Motaz K. Saad, Dept. of CS
Protected Mode (1 of 2)
4 GB addressable RAM
(00000000 to FFFFFFFFh)
Each program assigned a memory partition which is protected from other programs
easy to do; abstracted from hardware; details hidden
slowest performance
Level 2: Call an operating system function
specific to one OS; device-independent
medium performance
Level 1: Call a BIOS (basic input-output system) function
may produce different results on different systems
knowledge of hardware required
usually good performance
Level 0: Communicate directly with the hardware
May not be allowed by some operating systems
Motaz K. Saad, Dept. of CS
Displaying a String of Characters
When a HLL program displays a string of characters, the following steps take place:
Motaz K. Saad, Dept. of CS
ASM Programming levels ASM programs can perform input-output at each of the following levels: Motaz K. Saad, Dept. of CS
Summary
Central Processing Unit (CPU)
Arithmetic Logic Unit (ALU)
Instruction execution cycle
Multitasking
Floating Point Unit (FPU)
Complex Instruction Set
Real mode and Protected mode
Motherboard components
Memory types
Input/Output and access levels
Motaz K. Saad, Dept. of CS
More Details about X86 Family Architecture X86 family Generations Motaz K. Saad, Dept. of CS
X86 Family
8086 and 8088 Microprocessors
80x86 architecture
Address bus : 20 bits, 16 bits for 8-bit chips Max. memory capacity : 1 Mbytes
Internal structure is divided into BIU and EU Fetch and instruction execution can occur simultaneously
Length of internal registers expanded from 8 bit to 16 bit
H/W multiply and divide instructions built into the processor
Support for an external math coprocessor for floating-point operations in H/W as much as 100 times faster
Motaz K. Saad, Dept. of CS
Intel 8085 architecture : 8-bit data, 16-bit address Motaz K. Saad, Dept. of CS
Internal architecture of 8086 Motaz K. Saad, Dept. of CS
PC Standard
For 16bit data bus, two 8-bit memory banks are required expensive at the time
in 1979, Intel announced 8088 µ -P that is identical to the 8086 except an external 8-bit data bus. Two memory accesses are needed to input a word.
IBM announced the IBM-PC, using 8088 µ -P and 16 KB memory (expandable to 64 KB). Clock speed : 4.77 MHz -------- PC standard is defined.
Motaz K. Saad, Dept. of CS
80186 and 80188 Microprocessors
High-integration CPUs : includes 8086 (or 8088) core and a clock generator, a programmable timer, an interrupt controller, a DMA controller, etc.
Instruction set is fully compatible to 8086 and 8088, but include 9 new instructions.
Used for IBM-PC compatibles and many embedded computers.
Motaz K. Saad, Dept. of CS
80286 Microprocessor
Processor of IBM PC-AT
Provide two programming modes
Real mode - functions exactly same as 8086 - use only 20 least significant address lines (max. 1 MB) - faster than 8086 due to redesigning and higher clock
Protected mode - 16 new instructions are added - support multi-program environment by giving each program a predetermined amount of memory (16 MB) - programs no longer have physical addresses, but are addressed by a segment selector - Several programs can be loaded into memory at the same time, but protected from each other (*MS-DOS)
Motaz K. Saad, Dept. of CS
The 8086 and 80286 microprocessors. Motaz K. Saad, Dept. of CS
80386 Microprocessor
New Standard announced (1985) by Intel with commitment of successive u-P generations being remained compatible with this chip, Intel Architecture-32 ( IA-32 ) thru 2000.
Data bus & internal registers : 32 bits
Address bus : 32 bits max. 4 GB of physical memory
Motaz K. Saad, Dept. of CS
Internal architecture of 80386 Motaz K. Saad, Dept. of CS
Internal registers (partly) of 80386 Motaz K. Saad, Dept. of CS
80386 supports two operating modes (like 80286) 1) Real Address Mode - used by MS-DOS - in this mode, 80386 becomes a fast 8086. 2) Protected Virtual Address Mode (Protected Mode) - On-board MMU manages 4 GB of memory - Each task is given a segment of memory governed by a descriptor register , that defines the segment base address, the segment limit, and the attributes for the segment (code, data, read-only, etc.) - Use paging technique : 4 KB pages can be swapped in and out of memory (using a disk) to allow a task to have a virtual memory space as large as 64 TB.
Motaz K. Saad, Dept. of CS
When operating with 64 KB of cache, the 386 achieves a hit rate of 93% the processor operates at full speed 93% of the time
Instruction set of 386 is 100% compatible with the older processors in the family.
14 new instructions are added and several others are modified. [ex] data can be moved between the internal registers at a time.
80386SX : designed to ease the transition from 16- to 32-bit processors --- 16-bit external data bus and 24-bit address bus .
Motaz K. Saad, Dept. of CS
80486 Microprocessor
Maintain compatibility with the older u-Ps
Only 6 new instructions are added to be used by OS S/W, not by application programs.
Redesigned using RISC concepts frequently used instructions to execute in a single clock cycle.
New 5-stage instruction execution pipeline 5 instructions can be executed at once.
On-board 8K cache and 80387 coprocessor twice faster than 386 (20 MHz 387 = 40 MHz 386)
486SX : excludes 80387, designed for low-end appli- cations that do not require a coprocessor.
Motaz K. Saad, Dept. of CS
486DX2 and DX4
DX2 : the internal clock rate is twice the external clock.
DX4 : the internal clock rate is three times . Allow to use less expensive components on the computer system board, while the processor operate at its maximum data rate (internally). [Ex] 486DX2 66 : 66 MHz (int. clock) & 33 MHz (ext. clock) 486DX4 100 : 100 MHz (int. clock) & 33 MHz (ext. clock)
Overdrive Processors : 486 system boards include an over- drive socket to allow users to upgrade low-speed 486DX or 486SX with 486DX2 and DX4 style processors.
Motaz K. Saad, Dept. of CS
Pentium
Superscalar Architecture : provides two instruction execution pipelines, each with its own ALU, address generation circuitry, and data cache interface. execute two different instructions simultaneously
Additional Features :
includes on-board cache (separate 8K instruction cache and data cache) and a coprocessor
8-stage instruction pipelines
achieves 5~8 times floating-point performance of 486
external data bus : 64 bits
about twice as fast as the 486
Motaz K. Saad, Dept. of CS
Key features of the Pentium microprocessor. The execution unit has two pipelines allowing two instructions to be executed simultaneously. Motaz K. Saad, Dept. of CS
57 instructions are added for multimedia (audio, video, and graphic data) applications.
SIMD(Single-Instruction stream Multiple-Data stream) allows the same operation to be performed on multiple data items. Because many multimedia applications require large blocks of data to be manipulated, SIMD provides a significant performance enhancement.
Internal cache size is increased from 16K to 32K.
For general applications, 10~20% performance improved.
For multimedia applications, nearly 70% improved.
Motaz K. Saad, Dept. of CS
Socket 7 : ZIF(zero insertion force) socket
Pentium chip : 296-pin PGA package. A heat sink and fan are mounted atop the chip, and the entire assembly plugged into a ZIF, so-called socket 7.
Socket 7 defines a platform that defines the front side bus connection to the L2 cache, disk interface, video interface, and the ISA and PCI expansion buses.
Motaz K. Saad, Dept. of CS
Pentium processor with heat sink and fan mated to a Socket 7 connector. Motaz K. Saad, Dept. of CS
Pentium Pro
6 th - generation processors ( Pentium Pro, Pentium II, Pentium III and Celeron )
36 address lines max. 64 GB memory
New features 1. Inclusion of L2 cache in the same package with proc. 2. New system board platform called Socket 8 (Pro), slot 1 & 2 (Pentium II, III, and Celeron), and Socket 370 (Pentium III and Celeron). 3. New instruction architecture based on Dynamic Execution
Two chips in One Package : Pentium Pro consists of two separate silicon dies – one for the processor and the other for 256KB L2 cache.
Motaz K. Saad, Dept. of CS
The Pentium Pro is two chips in one. The larger die is the processor, the smaller a 256K L2 cache. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Dynamic Execution : a new approach to processing S/W instructions that reduces idle processor time.
Multiple Branch Prediction : Pentium Pro can look as far as 30 instructions ahead to anticipate conditional branches reduce waste of pipeline clocks.
Data Flow Analysis : looks at upcoming S/W instruc- tions for the optimal sequence of processing.
Speculative Execution : allows to execute instructions in a different order from which they are entered the processor = “ out-of-order execution ” . The result of these instructions are stored as speculative results until their final states can be determined.
Motaz K. Saad, Dept. of CS
Superscalar Processor of Degree Three : Pentium has three instruction decoders, and can execute 3 simul- taneous instructions. Internal Cache : L2 cache in the same package. Motaz K. Saad, Dept. of CS
Pentium II
Pentium Pro is dead (short life) due to
the lack of MMX instructions
use of the expensive dual- and tri-cavity package
Pentium II is a Pentium Pro with MMX technology, repackaged in a new single-edge contact(SEC) cartridge that is inserted in “ Slot 1 connector – 242 pins ” or “ Slot 2 connector – 330 pins
Processor and L2 are mounted on a ceramic substrate (silicon dies are separate)
Exploded view of single-edge contact (SEC) cartridge. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Installing the SEC cartridge into the retention mechanism. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Celeron
Pentium II without L2 cache (Pentium II SX ?)
Use the slot 1 connector without the plastic cover called “ naked CPU ”
Celeron A : Include 128KB L2 cache on the same die with processor.
Drawback : 66 MHz bus cycle
370-pin PGA package (called Socket 370)
Motaz K. Saad, Dept. of CS
The Celeron processor is a Pentium II without the L2 cache. Later versions, called the Celeron A, include this cache on the same silicon die with the processor. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Pentium III
H igher clock speed : b ased on the Pentium II core, with 600MHz clock and an external bus freq. of 133MHz
70 new streaming SIMD extensions (SSE) :
50 to improve floating-point performance
12 to improve multimedia processing
8 to improve the efficiency of L1 cache
Motaz K. Saad, Dept. of CS
The Pentium III microprocessor with integrated L2 cache. This chip has more than 22 million transistors. (Courtesy of Intel Corporation.) Motaz K. Saad, Dept. of CS
Xeon Processors
Scalability : As processing demands increase, additional processors can be interconnected to keep pace. - One of the advantages of Pentium Pro that can support up to 4 processors ; SMP (symmetric multiprocessing)
Pentium II Xeon processor can be scaled to 2, 4, 8 or more, and used for high-end server and workstations.
Pentium III Xeon processor : similar but offer the strea- ming SIMD technology.
Explicit parallelism : instructions are packed in 128-bit bundles ready for execution. Each bundle consists of 3 41-bit instructions and 5-bit template. All three inst- ructions are dispatched in parallel
Motaz K. Saad, Dept. of CS
Speculation : preload data to minimize memory delays when data is needed
Predication : When a conditional branch instruction is encountered, Itanium follows both branch paths, then commits the results of the correct path only.
Data bus : 128 bits
Address bus : 64 bits max. 2 64 bytes memory
Motaz K. Saad, Dept. of CS
80x86 Compatible Microprocessors
Second Sources : manufacturing 80x86 u-P chips after licensed by Intel.
Clones and Look-Alikes
Pin-for-pin replacements with all of the same fea- tures as the Intel processor. [Ex] AMD 386DX, 486DX4-100, Cyrix 5x86, etc.
Motaz K. Saad, Dept. of CS
The AMD K7 or Athlon processor. It mates to a new proprietary socket called Slot A. (Courtesy of Advanced Micro Devices.) Motaz K. Saad, Dept. of CS
Measuring Processor Performance
Benchmark programs : used to measure the performance of a computer system (system benchmarks) or of a com- ponent in that system such as the processor, disk, video card, or main memory (component benchmarks).
Component-level Benchmarks
Whetstone : used to measure the time to execute integer and floating-point arithmetic instructions and “ if ” statements. --- including a high percentage of fl.pt. operations mostly used to represent numerical programs.
Motaz K. Saad, Dept. of CS
Dhrystone : a synthetic benchmark consisting of 12 procedures with 94 statements, no fl.-pt. ops.
Microprocessor Benchmarks : developed for compa- ring the processing ability of the vaious u-P chips. --- Ziff-Davis ’ CPUmark and Intel ’ s iCOMP index.
CPUmark : measures the speed of a PC ’ s proc- essor subsystem, including the CPU, its internal and external caches, and system RAM. [Ex] Fig. 1-20 : CPUmark99 ratings for 80x86s
iCOMP : combines 4 industry standard benchmarks : CPUmark32, Norton SI32, SPEC95, and the Intel Media Benchmark (audio, vedio, image, 3-D, etc.).
Motaz K. Saad, Dept. of CS
CPUmark is a benchmark that measures the speed of the processor and its internal cache. Motaz K. Saad, Dept. of CS
System-level Benchmarks
Microcomputer Benchmarks : measures the speed of processor with considering a slow disk or video subsystem.
Winston : System-level, application-based benchmark to measure a PC ’ s overall performance when running today ’ s 32-bit applications on Window 95, 98, NT. [Ex] Winstone 98 ratings for 80x86s
Performance Rating : Cyrix and AMD developed the P-rating (Processor Performance) system --- running applications on a processor and compare to a Pentium u-P.
[Ex] Table 1-2 : PR166 ~ 366 for AMD and Cyrix chips
Motaz K. Saad, Dept. of CS
Winstone 98 measures the performance of a PC system running typical Windows applications. Motaz K. Saad, Dept. of CS
1 comments
Comments 1 - 1 of 1 previous next Post a comment