L07_performance and cost in advanced hardware- computer architecture.pptx
1. COE 585:
Advanced Computer Architecture
Lecture 6: Performance and Cost
Instructor: K. O. Boateng
Semester 1, 2023/24
Dept. of Comp. Engg., KNUST, Kumasi
2. Performance and Cost
Which of the following airplanes has the best
performance?
How much faster is the Concorde vs. the 747
How much bigger is the 747 vs. DC-8?
3. Performance and Cost
Which computer is fastest?
Not so simple
– Scientific simulation – FP performance
– Program development – Integer performance
– Database workload – Memory, I/O
4. Performance of Computers
Want to buy the fastest computer for what
you want to do?
– Workload is all-important
– Correct measurement and analysis
Want to design the fastest computer for
what the customer wants to pay?
– Cost is an important criterion
5. Performance is the key to understanding underlying motivation for the
hardware and its organization
Measure, report, and summarize performance to enable users to
– make intelligent choices
– see through the marketing hype!
Why is some hardware better than others for different programs?
What factors of system performance are hardware related?
(e.g., do we need a new machine, or a new operating system?)
How does the machine's instruction set affect performance?
Performance
6. Outline of the rest
Time and performance
Formulation for Performance Evaluation
MIPS and MFLOPS
Amdahl’s law
7. Response Time (elapsed time, latency):
– how long does it take for my job to run?
– how long does it take to execute (start to
finish) my job?
– how long must I wait for the database query?
Throughput:
– how many jobs can the machine run at once?
– what is the average execution rate?
– how much work is getting done?
If we upgrade a machine with a new processor what do we increase?
If we add a new machine to the lab what do we increase?
Computer Performance: TIME,
TIME, TIME!!!
Individual user
concerns…
Systems manager
concerns…
8. Defining Performance
What is important to whom?
Computer system user
– Minimize elapsed time for program =
time_end – time_start
– Called response time
Computer center manager
– Maximize completion rate = #jobs/second
– Called throughput
9. Response Time vs. Throughput
Is throughput = 1/av. response time?
– Only if NO overlap
– Otherwise, throughput > 1/av. response time
– E.g. a lunch buffet – assume 5 entrees
– Each person takes 2 minutes/entrée
– Throughput is 1 person every 2 minutes
– BUT time to fill up tray is 10 minutes
– Why and what would the throughput be otherwise?
5 people simultaneously filling tray (overlap)
Without overlap, throughput = 1/10
10. What is Performance for us?
For computer architects
– CPU time = time spent running a program
Intuitively, bigger should be faster, so:
– Performance = 1/X time, where X is response,
CPU execution, etc.
Elapsed time = CPU time + I/O wait
We will concentrate on CPU time
11. Elapsed Time
– counts everything (disk and memory accesses, waiting for I/O, running
other programs, etc.) from start to finish
– a useful number, but often not good for comparison purposes
elapsed time = CPU time + wait time (I/O, other programs, etc.)
CPU time
– doesn't count waiting for I/O or time spent running other programs
– can be divided into user CPU time and system CPU time (OS calls)
CPU time = user CPU time + system CPU time
elapsed time = user CPU time + system CPU time + wait time
Our focus: user CPU time (CPU execution time or, simply, execution
time)
– time spent executing the lines of code that are in our program
Execution Time
12. For some program running on machine X:
PerformanceX = 1 / Execution timeX
X is n times faster than Y means:
PerformanceX / PerformanceY = n
Definition of Performance
13. Improve Performance
Improve (a) response time or (b)
throughput?
– Faster CPU
Helps both (a) and (b)
– Add more CPUs
Helps (b) and perhaps (a) due to less queueing
14. Performance Comparison
Machine A is n times faster than machine B iff
perf(A)/perf(B) = time(B)/time(A) = n
Machine A is x% faster than machine B iff
– perf(A)/perf(B) = time(B)/time(A) = 1 + x/100
E.g. time(A) = 10s, time(B) = 15s
– 15/10 = 1.5 => A is 1.5 times faster than B
– 15/10 = 1.5 => A is 50% faster than B
15. Breaking Down Performance
A program is broken into instructions
– hardware is aware of instructions, not programs
At lower level, hardware breaks instructions into
cycles
– Lower level state machines change state every cycle
For example:
– 1GHz Snapdragon runs 1000M cycles/sec, 1 cycle =
1ns
– 2.5GHz Core i7 runs 2.5G cycles/sec, 1 cycle = 0.40ns
16. Clock Cycles
Instead of reporting execution time in seconds, we often use cycles. In
modern computers hardware events progress cycle by cycle: in other
words, each event, e.g., multiplication, addition, etc., is a sequence of
cycles
Clock ticks indicate start and end of cycles:
cycle time = time between ticks = seconds per cycle
clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec, 1
MHz. = 106 cycles/sec)
Example: A 200 Mhz. clock has a
cycle time
time
seconds
program
cycles
program
seconds
cycle
1
200 106
109 5 nanoseconds
cycle
tick
tick
17. Performance Equation I
So, to improve performance one can either:
– reduce the number of cycles for a program, or
– reduce the clock cycle time, or, equivalently,
– increase the clock rate
seconds
program
cycles
program
seconds
cycle
CPU execution time CPU clock cycles Clock cycle time
for a program for a program
=
equivalently
18. Could assume that # of cycles = # of instructions
time
1st
instruction
2nd
instruction
3rd
instruction
4th
5th
6th
...
How many cycles are required
for a program?
This assumption is incorrect! Because:
Different instructions take different amounts of time (cycles)
Why…?
19. Multiplication takes more time than addition
Floating point operations take longer than integer ones
Accessing memory takes more time than accessing registers
Important point: changing the cycle time often changes the number of
cycles required for various instructions because it means changing the
hardware design.
time
How many cycles are required
for a program?
20. A given program will require:
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– (average) CPI (cycles per instruction)
a floating point intensive application might have a higher average CPI
– MIPS (millions of instructions per second)
this would be higher for a program using simple instructions
Terminology
21. Performance Equation II
CPU execution time Instruction count average CPI Clock cycle time
for a program for a program
Derive the above equation from Performance Equation I
=
22. Performance Factors
Processor Performance = ---------------
Time
Program
Architecture --> Implementation --> Realization
Compiler Designer Processor Designer Chip Designer
Instructions Cycles
Program Instruction
Time
Cycle
(code size)
= X X
(CPI) (cycle time)
23. Performance Factors
Instructions/Program
– Instructions executed, not static code size
– Determined by algorithm, compiler, ISA
Cycles/Instruction
– Determined by ISA and CPU organization
– Overlap among instructions reduces this factor
Time/cycle
– Determined by technology, organization, clever
circuit design
24. Other Metrics
MIPS and MFLOPS
MIPS = instruction count/(execution time x 106)
= clock rate/(CPI x 106)
But MIPS has serious shortcomings
25. Problems with MIPS
E.g. without floating-point hardware, an FP op
may take 50 single-cycle instructions at a clock
rate of say 1MHz
With FP hardware, only one 2-cycle instruction
at the same clock rate
Thus, adding FP hardware:
– Instructions/program
decreases
– CPI increases
– Total execution time decreases
BUT, MIPS gets worse!
50 => 1
50/50=1 => 2/1=2
50x1/1M => 1x2/1M
1 MIPS => 0.5 MIPS
50 /(50 x 10-6 x 106)
=1 MIPS
1/(2 x 10-6 x 106)
=0.5 MIPS
26. Problems with MIPS
Ignores program
Usually used to quote peak performance
– Ideal conditions => guaranteed not to exceed!
When is MIPS ok?
– Same compiler, same ISA
– E.g. same binary running on AMD Phenom,
Intel Core i7
– Why? Instr/program is constant and can be
ignored
27. Other Metrics
MFLOPS = FP ops in program/(execution time x 106)
Assuming FP ops independent of compiler and
ISA
– Often safe for numeric codes: matrix size determines
# of FP ops/program
– However, not always safe:
Missing instructions (e.g. FP divide)
Optimizing compilers
Relative MIPS and normalized MFLOPS
– Adds to confusion
Use ONLY Time
28. Our Goal
Minimize time which is the product, NOT
isolated factors
Common error is to miss factors while
devising optimizations
– E.g. ISA change to decrease instruction count
– BUT leads to CPU organization which makes
clock slower
Bottom line: factors are inter-related
29. Example 1
Machine A: clock 1ns, CPI 2.0, for program x
Machine B: clock 2ns, CPI 1.2, for program x
Which is faster and how much?
Time/Program = instr/program x cycles/instr x sec/cycle
Time(A) = N x 2.0 x 1 = 2N
Time(B) = N x 1.2 x 2 = 2.4N
Compare: Time(B)/Time(A) = 2.4N/2N = 1.2
So, Machine A is 20% faster than Machine B for
this program
30. Example 2
Keep clock(A) @ 1ns and clock(B) @2ns
For equal performance, if CPI(B)=1.2, what
is CPI(A)?
Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A))
CPI(A) = 2.4
31. Example 3
Keep CPI(A)=2.0 and CPI(B)=1.2
For equal performance, if clock(B)=2ns,
what is clock(A)?
Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2)
clock(A) = 1.2ns
32. Which Programs
Execution time of what program?
Best case – you always run the same set of
programs
– Port them and time the whole workload
In reality, use benchmarks
– Programs chosen to measure performance
– Predict performance of actual workload
– Saves effort and money
33. Performance best determined by running a real application
– use programs typical of expected workload
– or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused!
Benchmark suites
– Perfect Club: set of application codes
– Livermore Loops: 24 loop kernels
– Linpack: linear algebra package
– SPEC: mix of code from industry organization
Benchmarks
34. SPEC (System Performance
Evaluation Corporation)
Sponsored by industry but independent and self-managed – trusted
by code developers and machine vendors
Clear guides for testing, see www.spec.org
Regular updates (benchmarks are dropped and new ones added
periodically according to relevance)
Specialized benchmarks for particular classes of applications
Can still be abused…, by selective optimization!
35. SPEC History
First Round: SPEC CPU89
– 10 programs yielding a single number
Second Round: SPEC CPU92
– SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point
programs)
– compiler flags can be set differently for different programs
Third Round: SPEC CPU95
– new set of programs: SPEC CINT95 (8 integer programs) and SPEC
CFP95 (10 floating point)
– single flag setting for all programs
Fourth Round: SPEC CPU2000
– new set of programs: SPEC CINT2000 (12 integer programs) and
SPEC CFP2000 (14 floating point)
– single flag setting for all programs
– programs in C, C++, Fortran 77, and Fortran 90
36. Benchmarks: SPEC2000
12 integer and 14 floating-point programs
– Sun Ultra-5 300MHz reference machine has
score of 100
37. Benchmarks: SPEC CINT2000
Benchmark Description
164.gzip Compression
175.vpr FPGA place and route
176.gcc C compiler
181.mcf Combinatorial optimization
186.crafty Chess
197.parser Word processing, grammatical analysis
252.eon Visualization (ray tracing)
253.perlbmk PERL script execution
254.gap Group theory interpreter
255.vortex Object-oriented database
256.bzip2 Compression
300.twolf Place and route simulator
38. Benchmarks: SPEC CFP2000
Benchmark Description
168.wupwise Physics/Quantum Chromodynamics
171.swim Shallow water modeling
172.mgrid Multi-grid solver: 3D potential field
173.applu Parabolic/elliptic PDE
177.mesa 3-D graphics library
178.galgel Computational Fluid Dynamics
179.art Image Recognition/Neural Networks
183.equake Seismic Wave Propagation Simulation
187.facerec Image processing: face recognition
188.ammp Computational chemistry
189.lucas Number theory/primality testing
191.fma3d Finite-element Crash Simulation
200.sixtrack High energy nuclear physics accelerator design
301.apsi Meteorology: Pollutant distribution
39. SPEC CPU2000 reporting
Refer SPEC website www.spec.org for documentation
Single number result – geometric mean of normalized ratios for each
code in the suite
Report precise description of machine
Report compiler flag setting
40. Benchmark Pitfalls
Benchmark not representative
– Your workload is I/O bound, SPEC is useless
Benchmark is too old
– Benchmarks age poorly; benchmarketing
pressure causes vendors to optimize
compiler/hardware/software to benchmarks
– Need to be periodically refreshed
42. Amdahl’s Law: Quantitative
Principles of Computer Design
Amdahl’s Law:
The performance gain from improving
some portion of a computer is calculated
by:
Speedup = Performance for entire task using the enhancement
Performance for the entire task without using the enhancement
or Speedup = Execution time without the enhancement
Execution time for entire task using the enhancement
Here: Task = Program Recall: Performance = 1 /Execution Time
i.e using some enhancement
43. Performance Enhancement
Calculations: Amdahl's Law
The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is used
Amdahl’s Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = -------------------------------------- = ---------------------------------
Execution Time with E Performance without E
– Suppose that enhancement E accelerates a fraction F of the execution
time by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1
Speedup(E) = --------------------------------------------------------- = --------------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S
F (Fraction of execution time enhanced) refers
to original execution time before the enhancement is applied
original
44. Pictorial Depiction of
Amdahl’s Law
Before:
Execution Time without enhancement E: (Before enhancement is applied)
After:
Execution Time with enhancement E:
Enhancement E accelerates fraction F of original execution time by a factor of S
Unaffected fraction: (1- F) Affected fraction: F
Unaffected fraction: (1- F) F/S
Unchanged
Execution Time without enhancement E 1
Speedup(E) = ------------------------------------------------------ = ------------------
Execution Time with enhancement E (1 - F) + F/S
• shown normalized to 1 = (1-F) + F =1
What if the fraction given is
after the enhancement has been applied?
How would you solve the problem?
(i.e find expression for speedup)
45. Example 1
For a RISC machine with the following instruction mix:
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
If a CPU design enhancement improves the CPI of load instructions from
5 to 2, what is the resulting performance improvement from this
enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 1- F = 100% - 45% = 55% or .55
Factor of enhancement = S = 5/2 = 2.5
Using Amdahl’s Law:
1 1
Speedup(E) = ------------------ = --------------------- = 1.37
(1 - F) + F/S .55 + .45/2.5
CPI = 2.2
46. Example 1 Alternative solution
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
If a CPU design enhancement improves the CPI of load instructions from
5 to 2, what is the resulting performance improvement from this
enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time Instruction count x old CPI x clock cycle
Speedup(E) = ----------------------------------- = ----------------------------------------------------------------
New Execution Time Instruction count x new CPI x clock cycle
old CPI 2.2
= ------------ = --------- = 1.37
new CPI 1.6
Which is the same speedup obtained from Amdahl’s Law in the first solution.
CPI = 2.2
T = I x CPI x C
New CPI of load is now 2 instead of 5
47. Example 2
A program runs in 100 seconds on a machine with multiply operations
responsible for 80 seconds of this time. By how much must the speed
of multiplication be improved to make the program four times faster?
100
Desired speedup = 4 = -----------------------------------------------------
Execution Time with enhancement
Execution time with enhancement = 100/4 = 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / S
25 seconds = 20 seconds + 80 seconds / S
5 = 80 seconds / S
S = 80/5 = 16
Alternatively, it can also be solved by finding enhanced fraction of execution time:
F = 80/100 = .8
and then solving Amdahl’s speedup equation for desired enhancement factor S
Hence multiplication should be 16 times
faster to get an overall speedup of 4.
1 1 1
Speedup(E) = ------------------ = 4 = ----------------- = ---------------
(1 - F) + F/S (1 - .8) + .8/S .2 + .8/s
Solving for S gives S= 16
Machine = CPU
48. Example 3
For the previous example with a program running in 100 seconds on a
machine with multiply operations responsible for 80 seconds of this
time. By how much must the speed of multiplication be improved to
make the program five times faster?
100
Desired speedup = 5 = -----------------------------------------------------
Execution Time with enhancement
Execution time with enhancement = 100/5 = 20 seconds
20 seconds = (100 - 80 seconds) + 80 seconds / s
20 seconds = 20 seconds + 80 seconds / s
0 = 80 seconds / s
No amount of multiplication speed improvement can achieve this.
49. Multiple Enhancements
Suppose that enhancement Ei accelerates a fraction Fi of the
original execution time by a factor Si and the remainder of the
time is unaffected then:
i i
i
i
i
X
S
F
F
Speedup
Time
Execution
Original
)
1
Time
Execution
Original
)
((
i i
i
i
i
S
F
F
Speedup
)
( )
1
1
(
Note: All fractions Fi refer to original execution time before the
enhancements are applied.
Unaffected fraction
i = 1, 2, …. n
What if the fractions given are
after the enhancements were applied?
How would you solve the problem?
(i.e find expression for speedup)
n enhancements each affecting a different portion of execution time
50. Multiple Enhancements Example
Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:
Speedup1 = S1 = 10 Percentage1 = F1 = 20%
Speedup2 = S2 = 15 Percentage1 = F2 = 15%
Speedup3 = S3 = 30 Percentage1 = F3 = 10%
While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
What is the resulting overall speedup?
Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1 / [ .55 + .0333 ]
= 1 / .5833 = 1.71
i i
i
i
i
S
F
F
Speedup
)
( )
1
1
(
51. Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions Fi refer to original execution time.
Unaffected, fraction: .55
Unchanged
Unaffected, fraction: .55 F1 = .2 F2 = .15 F3 = .1
S1 = 10 S2 = 15 S3 = 30
/ 10 / 30
/ 15
What if the fractions given are
after the enhancements were applied?
How would you solve the problem?
i.e normalized to 1
52. Summary
Time and performance: Machine A n times
faster than Machine B
– Iff Time(B)/Time(A) = n
Performance = Time/program =
Instructions Cycles
Program Instruction
Time
Cycle
(code size)
= X X
(CPI) (cycle time)
53. Summary Cont’d
Other Metrics: MIPS and MFLOPS
– Beware of peak and omitted details
Benchmarks: SPEC2000
Amdahl’s Law:
s
f
f
Speedup
1
1