1. UG Program
School of Electrical and Computer Engineering
Chapter 2:Computer Evolution and
Performance
Computer Architecture
and Organization
(ECEG-3163 )
2. Sem. I, 2014/15
Generations of Computer
2
Computer Architecture and Organization
Vacuum tube - 1946-1957
Transistor - 1958-1964
Small scale integration - 1965 on
Up to 100 devices on a chip
Medium scale integration - to 1971
100-3,000 devices on a chip
Large scale integration - 1971-1977
3,000 - 100,000 devices on a chip
Very large scale integration - 1978 -1991
100,000 - 100,000,000 devices on a chip
Ultra large scale integration – 1991 -
Over 100,000,000 devices on a chip
3. Sem. I, 2014/15
ENIAC - background
3
Computer Architecture and Organization
Electronic Numerical
Integrator And Computer
Eckert and Mauchly
University of Pennsylvania
Trajectory tables for weapons
Started 1943
Finished 1946
Too late for war effort
Used until 1955
4. Sem. I, 2014/15
ENIAC - details
Decimal (not binary)
20 accumulators of 10 digits
Programmed manually by switches
18,000 vacuum tubes
30 tons
15,000 square feet
140 kW power consumption
5,000 additions per second
Showed the general purpose of a computer
4
Computer Architecture and Organization
5. Sem. I, 2014/15
Von Neumann/Turing
Stored Program concept
Main memory storing programs and data
ALU operating on binary data
Control unit interpreting instructions from memory and executing
Input and output equipment operated by control unit
Princeton Institute for Advanced Studies
IAS
Completed 1952
The first publication of the idea was in a 1945 proposal by von
Neumann for a new computer, the EDVAC (Electronic Discrete
Variable Computer)
5
Computer Architecture and Organization
6. Sem. I, 2014/15
Structure of von Neumann machine
A main memory, which
stores both data and
instructions
An arithmetic and logic
unit (ALU)capable of
operating on binary data
A control unit, which
interprets the instructions in
memory and causes them to
be executed
Input/output(I/O)equipment
operated by the control unit
6
Computer Architecture and Organization
7. Sem. I, 2014/15
Brief Description of the IAS computer
The memory of the IAS consists of 1000 storage locations, called
words, of 40 binary digits (bits) each.
Both data and instructions are stored there.
Numbers are represented in binary form, and each instruction is a
binary code.
7
Computer Architecture and Organization
8. Sem. I, 2014/15
IAS Memory Format
8
Computer Architecture and Organization
Each number is represented by a sign bit and a 39-bit value.
A word may also contain two 20-bit instructions, with each instruction
consisting of an 8-bit operation code (opcode)
12-bit address
9. Sem. I, 2014/15
IAS Main Registers
9
Computer Architecture and Organization
Memory buffer register (MBR):Contains a word to be stored in
memory or sent to the I/O unit, or is used to receive a word from
memory or from the I/O unit.
Memory address register (MAR):Specifies the address in memory of
the word to be written from or read into the MBR.
Instruction register (IR):Contains the 8-bit opcode instruction being
executed.
Instruction buffer register (IBR):Employed to hold temporarily the
right hand instruction from a word in memory.
Program counter (PC):Contains the address of the next instruction
pair to be fetched from memory.
Accumulator (AC) and multiplier quotient (MQ): Employed to hold
temporarily operands and results of ALU operations. For example, the
result of multiplying two 40-bit numbers is an 80-bit number; the most
significant 40 bits are stored in the AC and the least significant in the
MQ.
11. Sem. I, 2014/15
IAS operation
Operates by repetitively performing an Instruction cycle.
Each instruction cycle consists of two subcycles. Instruction Fetch and
Instruction Execute Cycle
In fetch cycle the opcode of the next instruction is loaded into the IR and the
address portion is loaded into the MAR.
This instruction may be taken from the IBR or from memory by loading a word
into the MBR, and then down to the IBR, IR, and MAR.
There is only one register that is used to specify the address in memory for a
read or write and One register is used for the source or destination.
Once the opcode is in the IR, the execute cycle is performed.
Control circuitry interprets the opcode and executes the instruction.
11
Computer Architecture and Organization
12. Sem. I, 2014/15
IAS Operation Partial Flow Chart
12
Computer Architecture and Organization
13. Sem. I, 2014/15
IAS Instruction Set
The IAS computer had a total of 21 instructions, which can be grouped as follows:
Data transfer: Move data between memory and ALU registers or between two
ALU registers.
Unconditional branch: Normally, the control unit executes instructions in
sequence from memory. This sequence can be changed by a branch
instruction, which facilitates repetitive operations.
Conditional branch: The branch can be made dependent on a condition, thus
allowing decision points.
Arithmetic: Operations performed by the ALU.
Address modify: Permits addresses to be computed in the ALU and then
inserted into instructions stored in memory. This allows a program considerable
addressing flexibility.
13
Computer Architecture and Organization
14. Sem. I, 2014/15
IAS Instruction Set Details
14
Computer Architecture and Organization
15. Sem. I, 2014/15
Commercial Computers
1947 - Eckert-Mauchly Computer Corporation
UNIVAC I (Universal Automatic Computer)
US Bureau of Census 1950 calculations
Became part of Sperry-Rand Corporation
Late 1950s - UNIVAC II
Faster
More memory
15
Computer Architecture and Organization
16. Sem. I, 2014/15
IBM
Punched-card processing equipment
1953 - the 701
IBM’s first Electronic stored program computer
Scientific calculations
1955 - the 702
Business applications
Lead to 700/7000 series
16
Computer Architecture and Organization
17. Sem. I, 2014/15
Transistors
Replaced vacuum tubes
Smaller
Cheaper
Less heat dissipation
Solid State device
Made from Silicon (Sand)
Invented 1947 at Bell Labs
William Shockley et al.
17
Computer Architecture and Organization
18. Sem. I, 2014/15
Second Generation : Transistors
Transistor Based Computers
More complex arithmetic and logic units and control units,
The use high-level programming languages and software provided the ability to
load programs,(beginning of OSes)
Data channels:-independent I/O module with its own processor & Instruction
set. -relieves the CPU of a considerable processing burden.
Multiplexor :- termination point for data channels, the CPU, and memory.
- schedules access to the memory from the CPU and data channels
NCR & RCA produced small transistor machines
IBM 7000
DEC - 1957
Produced PDP-1
18
Computer Architecture and Organization
19. Sem. I, 2014/15
Microelectronics
Literally - “small electronics”
A computer is made up of gates, memory cells and
interconnections
These can be manufactured on a semiconductor material.
e.g. silicon wafer
Jack Kilby invented the integrated circuit at Texas
Instruments in 1958 creating a transistor , resistor and
capacitor from a piece of Germanium
Robert Noyce also
independently invented
IC
19
Computer Architecture and Organization
20. Sem. I, 2014/15
Microelectronics- Modern IC Manufacturing
Using technique of Photolithography
20
Computer Architecture and Organization
21. Sem. I, 2014/15
Moore’s Law
Gordon Moore – co-founder of Intel
Number of transistors on a chip will double every year
Since 1970’s development has slowed a little
Number of transistors doubles every 18 months
Cost of a chip has remained almost unchanged
Consequence of Moore’s Law
Higher packing density means shorter electrical paths, giving
higher performance
Smaller size gives increased flexibility
Reduced power and cooling requirements
Fewer interconnections increases reliability
21
Computer Architecture and Organization
22. Sem. I, 2014/15
Growth in Transistor Count on Integrated Circuits
22
Computer Architecture and Organization
23. Sem. I, 2014/15
IBM 360 series
1964
Replaced (& not compatible with) 7000 series
First planned “family” of computers
Similar or identical instruction sets
Similar or identical O/S
Increasing speed
Increasing number of I/O ports (i.e. more terminals)
Increased memory size
Increased cost
23
Computer Architecture and Organization
24. Sem. I, 2014/15
DEC PDP-8
1964
First minicomputer
(after miniskirt!)
Small enough to
sit on a lab bench
$16,000
$100k+ for IBM 360
24
Computer Architecture and Organization
25. Sem. I, 2014/15
Intel
1971 - 4004
First microprocessor
All CPU components
on a single chip
4 bit
25
Computer Architecture and Organization
26. Sem. I, 2014/15
Intel …
Followed in 1972 by 8008
8 bit
Both designed for specific applications
1974 - 8080
Intel’s first general purpose microprocessor
IBM(1981) produced PC using Intel 8088
microprocessor
8086: A 16-bit machine
80386: Intel’s first 32-bit machine.
26
Computer Architecture and Organization
27. Sem. I, 2014/15
Semiconductor Memory
1970
Fairchild
Size of a single core
i.e. 1 bit of magnetic core storage
Holds 256 bits
Non-destructive read
Much faster than core
Capacity approximately doubles each year
27
Computer Architecture and Organization
28. Sem. I, 2014/15
Microprocessor Speed
Raw speed of the microprocessor is not used unless constant work is feed to
the CPU
Among the techniques used are
Pipelining: With pipelining, a processor can simultaneously work on multiple
instructions. I.e while one instruction is being executed, the computer is
decoding the next instruction.
Branch prediction: The processor looks ahead in the instruction code fetched
from memory and predicts which branches, or groups of instructions, are likely
to be processed next. The more sophisticated examples of this strategy predict
not just the next branch but multiple branches ahead.
28
Computer Architecture and Organization
29. Sem. I, 2014/15
Microprocessor Speed …
Data flow analysis: The processor analyzes which instructions are dependent
on each other’s results, or data, to create an optimized schedule of
instructions. In fact, instructions are scheduled to be executed when ready,
independent of the original program order. This prevents unnecessary delay.
Speculative execution: Using branch prediction and data flow analysis, some
processors speculatively execute instructions ahead of their actual appearance
in the program execution, holding the results in temporary locations. This
enables the processor to keep its execution engines as busy as possible by
executing instructions that are likely to be needed.These and other
sophisticated techniques are made necessary by the sheer power
29
Computer Architecture and Organization
30. Sem. I, 2014/15
Speeding it up
By increasing hardware speed of microprocessor
Fundamentally due to shrinking logic gate size
More gates, packed more tightly, reduce propagation time for signals
which intern increase clock speed
chipmakers fabricate chips of greater density and then
speed and the processor designers come up with
techniques that use the higher speed
30
Computer Architecture and Organization
31. Sem. I, 2014/15
Performance Balance
Processor speed increased
Memory capacity increased
Memory speed lags behind processor speed
31
Computer Architecture and Organization
32. Sem. I, 2014/15
Logic and Memory Performance Gap
32
Computer Architecture and Organization
33. Sem. I, 2014/15
Increased Cache Capacity
Increase number of bits retrieved at one time
Make DRAM “wider” rather than “deeper”
Change DRAM interface
Cache
Reduce frequency of memory access
More complex cache and cache on chip
Increase interconnection bandwidth
High speed buses
Hierarchy of buses
33
Computer Architecture and Organization
34. Sem. I, 2014/15
Problems with Clock Speed and Logic Density
Power
Power density increases with density of logic and clock speed
Dissipating large amount of heat and controlling the heat is
becoming difficult which will lead to permanent damage of
transistor
RC delay
Speed at which electrons flow limited by resistance and
capacitance of metal wires connecting them
Delay increases as RC product increases
Wire interconnects thinner, increasing resistance
Wires closer together, increasing capacitance
Solution:
More emphasis on other organizational and architectural
approaches
34
Computer Architecture and Organization
35. Sem. I, 2014/15
Increased Cache Capacity
Typically two or three levels of cache between processor
and main memory
Chip density increased
More cache memory on chip
Faster cache access
Pentium chip devoted about 10% of chip area to cache
Pentium 4 devotes about 50%
35
Computer Architecture and Organization
36. Sem. I, 2014/15
More Complex Execution Logic
Enable parallel execution of instructions
Pipeline works like assembly line
Different stages of execution of different instructions at same time
along pipeline
Superscalar allows multiple pipelines within single
processor
Instructions that do not depend on one another can be executed in
parallel
36
Computer Architecture and Organization
37. Sem. I, 2014/15
Diminishing Returns
As processors getting complex further significant increases
in parallelism likely to be relatively modest
Designing different techniques become difficult
Benefits from cache are reaching limit
37
Computer Architecture and Organization
38. Sem. I, 2014/15
New Approach – Multiple Cores
Multiple processors on single chip
Large shared cache
If software can use multiple processors, doubling number
of processors almost doubles performance
So, use two simpler processors on the chip rather than one
more complex processor
With two processors, larger caches are justified
Power consumption of memory logic less than processing logic
38
Computer Architecture and Organization
39. Sem. I, 2014/15
Intel Microprocessor Performance
39
Computer Architecture and Organization
40. Sem. I, 2014/15
x86 Evolution (1)
8080
first general purpose microprocessor
8 bit data path
Used in first personal computer – Altair
8086 – 5MHz – 29,000 transistors
much more powerful
16 bit
instruction cache, prefetch few instructions
8088 (8 bit external data bus) used in first IBM PC
80286
16 Mbyte memory addressable
up from 1Mb
80386
32 bit
Support for multitasking
80486
sophisticated powerful cache and instruction pipelining
built in maths co-processor
40
Computer Architecture and Organization
41. Sem. I, 2014/15
x86 Evolution (2)
Pentium
Superscalar
Multiple instructions executed in parallel
Pentium Pro
Increased superscalar organization
Aggressive register renaming
branch prediction
data flow analysis
speculative execution
Pentium II
MMX technology
graphics, video & audio processing
Pentium III
Additional floating point instructions for 3D graphics
41
Computer Architecture and Organization
42. Sem. I, 2014/15
x86 Evolution (3)
Pentium 4
Further floating point and multimedia enhancements
Core
First x86 with dual core
Core 2
64 bit architecture
Core 2 Quad – 3GHz – 820 million transistors
Four processors on chip
x86 architecture dominant outside embedded systems
Organization and technology changed dramatically
Instruction set architecture evolved with backwards compatibility
~1 instruction per month added
500 instructions available
42
Computer Architecture and Organization
43. Sem. I, 2014/15
Performance Assessment Clock Speed
Key parameters of computer assessment
Performance, cost, size, security, reliability, power consumption
System clock speed is given in terms of Hz or multiples of
clock rate, clock cycle, clock tick, cycle time
Signals in CPU take time to settle down to 1 or 0 and the
worst time taken by a signal inside the CPU is taken to be
the clock speed
Instruction execution in discrete steps
Fetch, decode, load and store, arithmetic or logical
Usually require multiple clock cycles per instruction
Pipelining gives simultaneous execution of instructions
So, clock speed is not the whole story in performance
assessment
43
Computer Architecture and Organization
44. Sem. I, 2014/15
Instruction Execution Rate
A processor is driven by a clock with a constant frequency for,
equivalently, a constant cycle time t, where t =1/f.
instruction count, Ic
An important parameter is the average cycles per instruction (CPI) for a
program.
If all instructions required the same number of clock cycles, then CPI
would be a constant value for a processor.
the number of clock cycles required varies for different types of
instructions, such as load, store, branch, and so on.
Let CPIi be the number of cycles required for instruction type I and Ii
be the number of executed instructions of type I for a given program.
44
Computer Architecture and Organization
45. Sem. I, 2014/15
Instruction Execution Rate
The processor time T needed to execute a given program can be expressed as
We can refine this formulation by recognizing that during the execution of an instruction,
part of the work is done by the processor, and part of the time a word is being transferred
to or from memory. In this latter case, the time to transfer depends on the memory cycle
time, which may be greater than the processor cycle time.
where p is the number of processor cycles needed to decode and execute the
instruction, m is the number of memory references needed, and k is the ratio between
them.
The five performance factors in the preceding equation (Ic, p, m, k, t) are influenced by
four system attributes: the design of the instruction set (known as instruction set
architecture), compiler technology (how effective the compiler is in producing an efficient
machine language program from a high-level language program), processor
implementation, and cache
and memory hierarchy.
45
Computer Architecture and Organization
47. Sem. I, 2014/15
Instruction Execution Rate Example
For example, consider the execution of a program that results in the execution
of 2 million instructions on a 400-MHz processor. The program consists of four
major types of instructions. The instruction mix and the CPI for each instruction
type are given below based on the result of a program trace experiment:
Calculate the Average CPI and MIPS ?
47
Computer Architecture and Organization
48. Sem. I, 2014/15
Instruction Execution Rate Example
48
Computer Architecture and Organization
The average CPI when the program is executed on a uniprocessor
with
the above trace results is
CPI =0.6 +(2 X0.18) +(4X0.12) +(8X0.1) =2.24.
The corresponding MIPS rate is
49. Sem. I, 2014/15
Benchmarks
Programs designed to test performance
Written in high level language
Portable
Represents style of task
Systems, numerical, commercial
Easily measured
Widely distributed
E.g. System Performance Evaluation Corporation (SPEC)
CPU2006 for computation bound
17 floating point programs in C, C++, Fortran
12 integer programs in C, C++
3 million lines of code
Speed and rate metrics
Single task and throughput
49
Computer Architecture and Organization
50. Sem. I, 2014/15
SPEC Speed Metric
50
Computer Architecture and Organization
Single task
Base runtime defined for each benchmark using
reference machine
Results are reported as ratio of reference time to system
run time
Trefi execution time for benchmark i on reference machine
Tsuti execution time of benchmark i on test system
• Overall performance calculated by averaging
ratios for all 12 integer benchmarks
—Use geometric mean
– Appropriate for normalized numbers such as ratios
51. Sem. I, 2014/15
SPEC Speed Metric
51
Computer Architecture and Organization
The Sun system executes this program in 934 seconds. The reference
implementation requires 22,135 seconds. The ratio is calculated as:
22136/934 =23.7.
52. Sem. I, 2014/15
SPEC Rate Metric
Measures throughput or rate of a machine carrying out a number of
tasks
Multiple copies of benchmarks run simultaneously
Typically, same as number of processors
Ratio is calculated as follows:
Trefi reference execution time for benchmark i
N number of copies run simultaneously
Tsuti elapsed time from start of execution of program on all N processors
until completion of all copies of program
Again, a geometric mean is calculated
52
Computer Architecture and Organization
53. Sem. I, 2014/15
Amdahl’s Law
Gene Amdahl [AMDA67]
Speedup in one aspect of the technology or design does
not result in a corresponding improvement in performance
Potential speed up of program using multiple processors
Concluded that:
Code needs to be parallelizable
Speed up is bound, giving diminishing returns for more processors
Task dependent
Servers gain by maintaining multiple connections on multiple
processors
Databases can be split into parallel tasks
53
Computer Architecture and Organization
54. Sem. I, 2014/15
Amdahl’s Law Formula
54
Computer Architecture and Organization
Conclusions
f small, parallel processors has little effect
N ->∞, speedup bound by 1/(1 – f)
Diminishing returns for using more processors
• For program running on single processor
—Fraction f of code infinitely parallelizable with no
scheduling overhead
—Fraction (1-f) of code inherently serial
—T is total execution time for program on single processor
—N is number of processors that fully exploit parralle
portions of code