L07_performance and cost in advanced hardware- computer architecture.pptx

COE 585:
Advanced Computer Architecture
Lecture 6: Performance and Cost
Instructor: K. O. Boateng
Semester 1, 2023/24
Dept. of Comp. Engg., KNUST, Kumasi

Performance and Cost
 Which of the following airplanes has the best
performance?
 How much faster is the Concorde vs. the 747
 How much bigger is the 747 vs. DC-8?

Performance and Cost
 Which computer is fastest?
 Not so simple
– Scientific simulation – FP performance
– Program development – Integer performance
– Database workload – Memory, I/O

Performance of Computers
 Want to buy the fastest computer for what
you want to do?
– Workload is all-important
– Correct measurement and analysis
 Want to design the fastest computer for
what the customer wants to pay?
– Cost is an important criterion

 Performance is the key to understanding underlying motivation for the
hardware and its organization
 Measure, report, and summarize performance to enable users to
– make intelligent choices
– see through the marketing hype!
 Why is some hardware better than others for different programs?
 What factors of system performance are hardware related?
(e.g., do we need a new machine, or a new operating system?)
 How does the machine's instruction set affect performance?
Performance

Outline of the rest
 Time and performance
 Formulation for Performance Evaluation
 MIPS and MFLOPS
 Amdahl’s law

 Response Time (elapsed time, latency):
– how long does it take for my job to run?
– how long does it take to execute (start to
finish) my job?
– how long must I wait for the database query?
 Throughput:
– how many jobs can the machine run at once?
– what is the average execution rate?
– how much work is getting done?
 If we upgrade a machine with a new processor what do we increase?
 If we add a new machine to the lab what do we increase?
Computer Performance: TIME,
TIME, TIME!!!
Individual user
concerns…
Systems manager
concerns…

Defining Performance
 What is important to whom?
 Computer system user
– Minimize elapsed time for program =
time_end – time_start
– Called response time
 Computer center manager
– Maximize completion rate = #jobs/second
– Called throughput

Response Time vs. Throughput
 Is throughput = 1/av. response time?
– Only if NO overlap
– Otherwise, throughput > 1/av. response time
– E.g. a lunch buffet – assume 5 entrees
– Each person takes 2 minutes/entrée
– Throughput is 1 person every 2 minutes
– BUT time to fill up tray is 10 minutes
– Why and what would the throughput be otherwise?
 5 people simultaneously filling tray (overlap)
 Without overlap, throughput = 1/10

What is Performance for us?
 For computer architects
– CPU time = time spent running a program
 Intuitively, bigger should be faster, so:
– Performance = 1/X time, where X is response,
CPU execution, etc.
 Elapsed time = CPU time + I/O wait
 We will concentrate on CPU time

 Elapsed Time
– counts everything (disk and memory accesses, waiting for I/O, running
other programs, etc.) from start to finish
– a useful number, but often not good for comparison purposes
elapsed time = CPU time + wait time (I/O, other programs, etc.)
 CPU time
– doesn't count waiting for I/O or time spent running other programs
– can be divided into user CPU time and system CPU time (OS calls)
CPU time = user CPU time + system CPU time
 elapsed time = user CPU time + system CPU time + wait time
 Our focus: user CPU time (CPU execution time or, simply, execution
time)
– time spent executing the lines of code that are in our program
Execution Time

 For some program running on machine X:
PerformanceX = 1 / Execution timeX
 X is n times faster than Y means:
PerformanceX / PerformanceY = n
Definition of Performance

Improve Performance
 Improve (a) response time or (b)
throughput?
– Faster CPU
 Helps both (a) and (b)
– Add more CPUs
 Helps (b) and perhaps (a) due to less queueing

Performance Comparison
 Machine A is n times faster than machine B iff
perf(A)/perf(B) = time(B)/time(A) = n
 Machine A is x% faster than machine B iff
– perf(A)/perf(B) = time(B)/time(A) = 1 + x/100
 E.g. time(A) = 10s, time(B) = 15s
– 15/10 = 1.5 => A is 1.5 times faster than B
– 15/10 = 1.5 => A is 50% faster than B

Breaking Down Performance
 A program is broken into instructions
– hardware is aware of instructions, not programs
 At lower level, hardware breaks instructions into
cycles
– Lower level state machines change state every cycle
 For example:
– 1GHz Snapdragon runs 1000M cycles/sec, 1 cycle =
1ns
– 2.5GHz Core i7 runs 2.5G cycles/sec, 1 cycle = 0.40ns

Clock Cycles
 Instead of reporting execution time in seconds, we often use cycles. In
modern computers hardware events progress cycle by cycle: in other
words, each event, e.g., multiplication, addition, etc., is a sequence of
cycles
 Clock ticks indicate start and end of cycles:
 cycle time = time between ticks = seconds per cycle
 clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec, 1
MHz. = 106 cycles/sec)
 Example: A 200 Mhz. clock has a
cycle time
time
seconds
program

cycles
program

seconds
cycle
1
200 106
109  5 nanoseconds
cycle
tick
tick

Performance Equation I
 So, to improve performance one can either:
– reduce the number of cycles for a program, or
– reduce the clock cycle time, or, equivalently,
– increase the clock rate
seconds
program

cycles
program

seconds
cycle
CPU execution time CPU clock cycles Clock cycle time
for a program for a program
=

equivalently

 Could assume that # of cycles = # of instructions
time
1st
instruction
2nd
instruction
3rd
instruction
4th
5th
6th
...
How many cycles are required
for a program?
 This assumption is incorrect! Because:
 Different instructions take different amounts of time (cycles)
 Why…?

 Multiplication takes more time than addition
 Floating point operations take longer than integer ones
 Accessing memory takes more time than accessing registers
 Important point: changing the cycle time often changes the number of
cycles required for various instructions because it means changing the
hardware design.
time
How many cycles are required
for a program?

 A given program will require:
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
 We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– (average) CPI (cycles per instruction)
 a floating point intensive application might have a higher average CPI
– MIPS (millions of instructions per second)
 this would be higher for a program using simple instructions
Terminology

Performance Equation II
CPU execution time Instruction count average CPI Clock cycle time
for a program for a program
 Derive the above equation from Performance Equation I
=
 

Performance Factors
Processor Performance = ---------------
Time
Program
Architecture --> Implementation --> Realization
Compiler Designer Processor Designer Chip Designer
Instructions Cycles
Program Instruction
Time
Cycle
(code size)
= X X
(CPI) (cycle time)

Performance Factors
 Instructions/Program
– Instructions executed, not static code size
– Determined by algorithm, compiler, ISA
 Cycles/Instruction
– Determined by ISA and CPU organization
– Overlap among instructions reduces this factor
 Time/cycle
– Determined by technology, organization, clever
circuit design

Other Metrics
 MIPS and MFLOPS
 MIPS = instruction count/(execution time x 106)
= clock rate/(CPI x 106)
 But MIPS has serious shortcomings

Problems with MIPS
 E.g. without floating-point hardware, an FP op
may take 50 single-cycle instructions at a clock
rate of say 1MHz
 With FP hardware, only one 2-cycle instruction
at the same clock rate
 Thus, adding FP hardware:
– Instructions/program
decreases
– CPI increases
– Total execution time decreases
 BUT, MIPS gets worse!
50 => 1
50/50=1 => 2/1=2
50x1/1M => 1x2/1M
1 MIPS => 0.5 MIPS
50 /(50 x 10-6 x 106)
=1 MIPS
1/(2 x 10-6 x 106)
=0.5 MIPS

Problems with MIPS
 Ignores program
 Usually used to quote peak performance
– Ideal conditions => guaranteed not to exceed!
 When is MIPS ok?
– Same compiler, same ISA
– E.g. same binary running on AMD Phenom,
Intel Core i7
– Why? Instr/program is constant and can be
ignored

Other Metrics
 MFLOPS = FP ops in program/(execution time x 106)
 Assuming FP ops independent of compiler and
ISA
– Often safe for numeric codes: matrix size determines
# of FP ops/program
– However, not always safe:
 Missing instructions (e.g. FP divide)
 Optimizing compilers
 Relative MIPS and normalized MFLOPS
– Adds to confusion
 Use ONLY Time

Our Goal
 Minimize time which is the product, NOT
isolated factors
 Common error is to miss factors while
devising optimizations
– E.g. ISA change to decrease instruction count
– BUT leads to CPU organization which makes
clock slower
 Bottom line: factors are inter-related

Example 1
 Machine A: clock 1ns, CPI 2.0, for program x
 Machine B: clock 2ns, CPI 1.2, for program x
 Which is faster and how much?
Time/Program = instr/program x cycles/instr x sec/cycle
Time(A) = N x 2.0 x 1 = 2N
Time(B) = N x 1.2 x 2 = 2.4N
Compare: Time(B)/Time(A) = 2.4N/2N = 1.2
 So, Machine A is 20% faster than Machine B for
this program

Example 2
Keep clock(A) @ 1ns and clock(B) @2ns
For equal performance, if CPI(B)=1.2, what
is CPI(A)?
Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A))
CPI(A) = 2.4

Example 3
 Keep CPI(A)=2.0 and CPI(B)=1.2
 For equal performance, if clock(B)=2ns,
what is clock(A)?
Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2)
clock(A) = 1.2ns

Which Programs
 Execution time of what program?
 Best case – you always run the same set of
programs
– Port them and time the whole workload
 In reality, use benchmarks
– Programs chosen to measure performance
– Predict performance of actual workload
– Saves effort and money

 Performance best determined by running a real application
– use programs typical of expected workload
– or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
 Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused!
 Benchmark suites
– Perfect Club: set of application codes
– Livermore Loops: 24 loop kernels
– Linpack: linear algebra package
– SPEC: mix of code from industry organization
Benchmarks

SPEC (System Performance
Evaluation Corporation)
 Sponsored by industry but independent and self-managed – trusted
by code developers and machine vendors
 Clear guides for testing, see www.spec.org
 Regular updates (benchmarks are dropped and new ones added
periodically according to relevance)
 Specialized benchmarks for particular classes of applications
 Can still be abused…, by selective optimization!

SPEC History
 First Round: SPEC CPU89
– 10 programs yielding a single number
 Second Round: SPEC CPU92
– SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point
programs)
– compiler flags can be set differently for different programs
 Third Round: SPEC CPU95
– new set of programs: SPEC CINT95 (8 integer programs) and SPEC
CFP95 (10 floating point)
– single flag setting for all programs
 Fourth Round: SPEC CPU2000
– new set of programs: SPEC CINT2000 (12 integer programs) and
SPEC CFP2000 (14 floating point)
– single flag setting for all programs
– programs in C, C++, Fortran 77, and Fortran 90

Benchmarks: SPEC2000
 12 integer and 14 floating-point programs
– Sun Ultra-5 300MHz reference machine has
score of 100

Benchmarks: SPEC CINT2000
Benchmark Description
164.gzip Compression
175.vpr FPGA place and route
176.gcc C compiler
181.mcf Combinatorial optimization
186.crafty Chess
197.parser Word processing, grammatical analysis
252.eon Visualization (ray tracing)
253.perlbmk PERL script execution
254.gap Group theory interpreter
255.vortex Object-oriented database
256.bzip2 Compression
300.twolf Place and route simulator

Benchmarks: SPEC CFP2000
Benchmark Description
168.wupwise Physics/Quantum Chromodynamics
171.swim Shallow water modeling
172.mgrid Multi-grid solver: 3D potential field
173.applu Parabolic/elliptic PDE
177.mesa 3-D graphics library
178.galgel Computational Fluid Dynamics
179.art Image Recognition/Neural Networks
183.equake Seismic Wave Propagation Simulation
187.facerec Image processing: face recognition
188.ammp Computational chemistry
189.lucas Number theory/primality testing
191.fma3d Finite-element Crash Simulation
200.sixtrack High energy nuclear physics accelerator design
301.apsi Meteorology: Pollutant distribution

SPEC CPU2000 reporting
 Refer SPEC website www.spec.org for documentation
 Single number result – geometric mean of normalized ratios for each
code in the suite
 Report precise description of machine
 Report compiler flag setting

Benchmark Pitfalls
 Benchmark not representative
– Your workload is I/O bound, SPEC is useless
 Benchmark is too old
– Benchmarks age poorly; benchmarketing
pressure causes vendors to optimize
compiler/hardware/software to benchmarks
– Need to be periodically refreshed

Specialized SPEC Benchmarks
 I/O
 Network
 Graphics
 Java
 Web server
 Transaction processing (databases)

Amdahl’s Law: Quantitative
Principles of Computer Design
 Amdahl’s Law:
The performance gain from improving
some portion of a computer is calculated
by:
Speedup = Performance for entire task using the enhancement
Performance for the entire task without using the enhancement
or Speedup = Execution time without the enhancement
Execution time for entire task using the enhancement
Here: Task = Program Recall: Performance = 1 /Execution Time
i.e using some enhancement

Performance Enhancement
Calculations: Amdahl's Law
 The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is used
 Amdahl’s Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = -------------------------------------- = ---------------------------------
Execution Time with E Performance without E
– Suppose that enhancement E accelerates a fraction F of the execution
time by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1
Speedup(E) = --------------------------------------------------------- = --------------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S
F (Fraction of execution time enhanced) refers
to original execution time before the enhancement is applied
original

Pictorial Depiction of
Amdahl’s Law
Before:
Execution Time without enhancement E: (Before enhancement is applied)
After:
Execution Time with enhancement E:
Enhancement E accelerates fraction F of original execution time by a factor of S
Unaffected fraction: (1- F) Affected fraction: F
Unaffected fraction: (1- F) F/S
Unchanged
Execution Time without enhancement E 1
Speedup(E) = ------------------------------------------------------ = ------------------
Execution Time with enhancement E (1 - F) + F/S
• shown normalized to 1 = (1-F) + F =1
What if the fraction given is
after the enhancement has been applied?
How would you solve the problem?
(i.e find expression for speedup)

Example 1
 For a RISC machine with the following instruction mix:
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
 If a CPU design enhancement improves the CPI of load instructions from
5 to 2, what is the resulting performance improvement from this
enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 1- F = 100% - 45% = 55% or .55
Factor of enhancement = S = 5/2 = 2.5
Using Amdahl’s Law:
1 1
Speedup(E) = ------------------ = --------------------- = 1.37
(1 - F) + F/S .55 + .45/2.5
CPI = 2.2

Example 1 Alternative solution
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
 If a CPU design enhancement improves the CPI of load instructions from
5 to 2, what is the resulting performance improvement from this
enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time Instruction count x old CPI x clock cycle
Speedup(E) = ----------------------------------- = ----------------------------------------------------------------
New Execution Time Instruction count x new CPI x clock cycle
old CPI 2.2
= ------------ = --------- = 1.37
new CPI 1.6
Which is the same speedup obtained from Amdahl’s Law in the first solution.
CPI = 2.2
T = I x CPI x C
New CPI of load is now 2 instead of 5

Example 2
 A program runs in 100 seconds on a machine with multiply operations
responsible for 80 seconds of this time. By how much must the speed
of multiplication be improved to make the program four times faster?
100
Desired speedup = 4 = -----------------------------------------------------
Execution Time with enhancement
Execution time with enhancement = 100/4 = 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / S
25 seconds = 20 seconds + 80 seconds / S
 5 = 80 seconds / S
 S = 80/5 = 16
Alternatively, it can also be solved by finding enhanced fraction of execution time:
F = 80/100 = .8
and then solving Amdahl’s speedup equation for desired enhancement factor S
Hence multiplication should be 16 times
faster to get an overall speedup of 4.
1 1 1
Speedup(E) = ------------------ = 4 = ----------------- = ---------------
(1 - F) + F/S (1 - .8) + .8/S .2 + .8/s
Solving for S gives S= 16
Machine = CPU

Example 3
 For the previous example with a program running in 100 seconds on a
machine with multiply operations responsible for 80 seconds of this
time. By how much must the speed of multiplication be improved to
make the program five times faster?
100
Desired speedup = 5 = -----------------------------------------------------
Execution Time with enhancement
Execution time with enhancement = 100/5 = 20 seconds
20 seconds = (100 - 80 seconds) + 80 seconds / s
20 seconds = 20 seconds + 80 seconds / s
 0 = 80 seconds / s
No amount of multiplication speed improvement can achieve this.

Multiple Enhancements
 Suppose that enhancement Ei accelerates a fraction Fi of the
original execution time by a factor Si and the remainder of the
time is unaffected then:
 



i i
i
i
i
X
S
F
F
Speedup
Time
Execution
Original
)
1
Time
Execution
Original
)
((
 



i i
i
i
i
S
F
F
Speedup
)
( )
1
1
(
Note: All fractions Fi refer to original execution time before the
enhancements are applied.
Unaffected fraction
i = 1, 2, …. n
What if the fractions given are
after the enhancements were applied?
(i.e find expression for speedup)
n enhancements each affecting a different portion of execution time

Multiple Enhancements Example
 Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:
Speedup1 = S1 = 10 Percentage1 = F1 = 20%
 While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
 What is the resulting overall speedup?
 Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1 / [ .55 + .0333 ]
= 1 / .5833 = 1.71
 



i i
i
i
i
S
F
F
Speedup
)
( )
1
1
(

Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions Fi refer to original execution time.
Unaffected, fraction: .55
Unchanged
Unaffected, fraction: .55 F1 = .2 F2 = .15 F3 = .1
S1 = 10 S2 = 15 S3 = 30
/ 10 / 30
/ 15
What if the fractions given are
after the enhancements were applied?
i.e normalized to 1

Summary
 Time and performance: Machine A n times
faster than Machine B
– Iff Time(B)/Time(A) = n
 Performance = Time/program =
Instructions Cycles
Program Instruction
Time
Cycle
(code size)
= X X
(CPI) (cycle time)

Summary Cont’d
 Other Metrics: MIPS and MFLOPS
– Beware of peak and omitted details
 Benchmarks: SPEC2000
 Amdahl’s Law:
s
f
f
Speedup



1
1

L07_performance and cost in advanced hardware- computer architecture.pptx

Recommended

Recommended

More Related Content

Similar to L07_performance and cost in advanced hardware- computer architecture.pptx

Similar to L07_performance and cost in advanced hardware- computer architecture.pptx (20)

Recently uploaded

Recently uploaded (20)

L07_performance and cost in advanced hardware- computer architecture.pptx