3. Improving Performance of ComputersImproving Performance of Computers
• Increasing clock speed
– Physical limitation (Need new hardware)
• Parallelism (Doing more things at once)
– Instruction-level parallelism
• Getting more instruction per second
– Processor-level parallelism
• Having multiple CPUs working on the same problem
Budditha Hettige 3
4. Instruction-level parallelismInstruction-level parallelism
• Pipelining
– Instruction execution speed is affected by time taken
to fetch instruction from memory
– Early Computers fetch instructions in advance and
stored in registers (Prefetch buffer)
• Prefetching divides instruction execution into two parts
– Fetching
– Actual execution
– Pipelining divides instruction in to many parts; each
handled by different hardware and can run in parallel
Budditha Hettige 4
5. Pipelining examplePipelining example
• Packaging cakes
– W1: Place an empty box on the belt every 10 second
– W2: Place the cake in the empty box
– W3: Close and seal the box
– W4: Label the box
– W5: Remove the box and place it in the large container
Budditha Hettige 5
6. Computer PipelinesComputer Pipelines
• S1: Fetch instruction from memory and place it in a buffer
until it is needed
• S2: Decode the instruction; determine it type and operands it
needs
• S3: locate the fetch operands from memory (or registers)
• S4: Execute instruction
• S5: Write back result in a register
Budditha Hettige 6
7. ExampleExample
T - Cycle time
N - Number of stages in the pipeline
Latency:
Time taken to execute an instruction = N x T
Processor Bandwidth:
No. of MIPS the CPU has = 1000 MIPS
T
Budditha Hettige 7
9. Dual pipelinesDual pipelines
• Instruction fetch unit fetches a pair of instructions and puts
each one into own pipeline
• Pentium has two five-stage pipelines
– U pipeline (main) executes an arbitrary Pentium instructions
– V pipeline (second) executes inter instructions, one simple
floating point instruction
• If instructions in a pair conflict, instruction in u pipeline is
executed. Other instruction is held and is paired with next
instruction
Budditha Hettige 9
13. Moore’s lawMoore’s law
• Describes a long-term trend in the history of
computing hardware
• Defined by Dr. Gordon Moore during the
sixties.
• Predicts an exponential increase in component
density over time, with a doubling time of 18
months.
• Applicable to microprocessors, DRAMs ,
DSPs and other microelectronics.
Budditha Hettige 13
15. Moore's Law and PerformanceMoore's Law and Performance
• The performance of computers is determined
by architecture and clock speed.
• Clock speed doubles over a 3 year period due
to the scaling laws on chip.
• Processors using identical or similar
architectures gain performance directly as a
function of Moore's Law.
• Improvements in internal architecture can
yield better gains than predicted by Moore's
Law.
Budditha Hettige 15
17. Measuring PerformanceMeasuring Performance
• Execution time:
– Time between start and completion of a task
(including disk accesses, memory accesses )
• Throughput:
– Total amount of work dome a given time
Budditha Hettige 17
18. Performance of a ComputerPerformance of a Computer
Two Computer X and Y;
Performance of (X) > Performance of (Y)
Execution Time (Y) > Execution Time (X)
Budditha Hettige 18
19. Performance of difference 2 ComputerPerformance of difference 2 Computer
X is n Time faster than Y
Budditha Hettige 19
20. CPU TimeCPU Time
• Time CPU spends on a task
• User CPU time
– CPU time spent in the program
• System CPU time
– CPU time spent in OS performing tasks on behalf
of the program
Budditha Hettige 20
21. CPU Time (Example)CPU Time (Example)
• User CPU time = 90.7s
• System CPU time 12.9s
• Execution time 2m 39 s 159s
• % of CPU time =
User CPU Time + System CPU Time
X 100 %
Execution time
Budditha Hettige 21
22. CPU TimeCPU Time
% CPU time = (90.7 + 12.9 ) x 100
159
= 65 %
Budditha Hettige 22
23. Clock RateClock Rate
• Computer clock runs at the constant rate and
determines when events take place in the
hardware
Clock Rate = 1
Clock Cycle
Budditha Hettige 23
24. Amdahl’s lawAmdahl’s law
• Performance improvement that can be gained
from some faster mode of execution is limited
by fraction of the time the faster mode can be
used
Budditha Hettige 24
25. Amdahl’s lawAmdahl’s law
• Speedup depends on
– Fraction of computation time in original machine
that can be converted to take advantage of the
enhancement
(Fraction Enhanced)
– Improvement gains by enhanced execution mode
(Speedup Enhanced)
Budditha Hettige 25
26. ExampleExample
Total execution time of a Program = 50 s
Execution time that can be enhanced = 30 s
FractionEnhanced = 30 /50
= 0.6
Budditha Hettige 26
28. ExampleExample
Normal mode execution time for some portion of
a program = 6s
Enhances mode execution time for the same
program = 2s
Speedup Enhanced = 6/2
= 3
Budditha Hettige 28
30. ExampleExample
• Suppose we consider an enhancement to the processor of a
server system used for Web serving. New CPU is 10 times
faster on computation in Web application than original CPU.
Assume original CPU is busy with computation 40% of the
time and is waiting for I/O 60% of time.
What is the overall speedup gained from
enhancement?
Budditha Hettige 30
32. RemarkRemark
• If an enhancement is only usable for fraction
of a task, we cannot speedup by more than
Budditha Hettige 32
33. ExampleExample
• A common transformation required in graphics
engines is square root. Implementation of floating-
point (FP) square root vary significantly in
performance, especially among processors designed
graphics
• Suppose FP square root (FPSQR) is responsible for
20% of execution tine of a critical graphics program
• Design alternative
1. Enhance EPSQR hardware and speed up this operation by
a factor of 10
2. Make all FP instruction run faster by a factor of 1.6
Budditha Hettige 33
34. ExampleExample
• FP instruction are responsible for a total of
50% of execution time. Design team believes
they can make all fp instruction run 1.6 times
faster with same effort as required for fast
square root.
Compare these two design alternatives
Budditha Hettige 34
36. CPU performance equationCPU performance equation
CPU time = CPU clock cycles for a program x Clock cycle time
= CPU clock cycles / Clock rate
Budditha Hettige 36
37. ExampleExample
A program runs in 10s on computer A having
400 MHz clock. A new machine B, which
could run the same program in 6s, has to be
designed. Further, B should have 1.2 times as
many clock cycles as A.
What should be the clock rate of B?
Budditha Hettige 37
39. CPU Clock CyclesCPU Clock Cycles
CPI (clock cycles per instruction)
average no. of clock cycles each instruction takes to
execute
IC (instruction count)
no. of instructions executed in the program
CPU clock cycles = CPI x IC
Note: CPI can be used to compare two different
implementations of the same instruction set architecture
(as IC required for a program is same)
Budditha Hettige 39
40. ExampleExample
• Consider two implementations of same instruction set
architecture. For a certain program, details of time
measurements of two machines are given below
• Which machine is faster for this program and by how
much?
Budditha Hettige 40
42. Measuring componentsMeasuring components
of CPU performance equationof CPU performance equation
• CPU Time: by running the program
• Clock Cycle Time: published in documentation
• IC: by a software tools/simulator of the architecture
((more difficult to obtain)
• CPI: by simulation of an implementation (more
difficult to obtain)
Budditha Hettige 42
43. CPU clock cyclesCPU clock cycles
Suppose n different types of instruction
Let
ICi – No. of times instruction i is executed in a program
CPIi – Avg. no. of clock cycles for instruction i
Budditha Hettige 43
44. ExampleExample
Suppose we have made the following measurements:
– Frequency of FP operations (other than FPSQR) = 25%
– Average CPI of FP operations = 4.0
– Average CPI of other instructions = 1.33
– Frequency of FPSQR= 2%
– CPI of FPSQR = 20
Design alternatives:
1. decrease CPI of FPSQR to 2
2. decrease average CPI of all FP operation to 2.5
Compare these two design alternatives using CPU performance
equation
Budditha Hettige 44
46. MIPS as a performance measureMIPS as a performance measure
Budditha Hettige 46
47. ProblemsProblems
MIPS as a performance measure
• MIPS is dependant on instruction set
– difficult to compare MIPS of computers with
different instruction sets
• MIPS can vary inversely to performance
Budditha Hettige 47
48. MFLOPS as a performance measureMFLOPS as a performance measure
Budditha Hettige 48
49. ProblemsProblems
MIPS as a performance measure
• MFLOPS is not dependable
– Cray C90 has no divide instructions while Pentium
has
• MFLOPS depends on the mixture of fast and
slow floating point operations
– add (fast) and divide (slow) operations
Budditha Hettige 49