SlideShare a Scribd company logo
1 of 53
COE 585:
Advanced Computer Architecture
Lecture 6: Performance and Cost
Instructor: K. O. Boateng
Semester 1, 2023/24
Dept. of Comp. Engg., KNUST, Kumasi
Performance and Cost
 Which of the following airplanes has the best
performance?
 How much faster is the Concorde vs. the 747
 How much bigger is the 747 vs. DC-8?
Performance and Cost
 Which computer is fastest?
 Not so simple
– Scientific simulation – FP performance
– Program development – Integer performance
– Database workload – Memory, I/O
Performance of Computers
 Want to buy the fastest computer for what
you want to do?
– Workload is all-important
– Correct measurement and analysis
 Want to design the fastest computer for
what the customer wants to pay?
– Cost is an important criterion
 Performance is the key to understanding underlying motivation for the
hardware and its organization
 Measure, report, and summarize performance to enable users to
– make intelligent choices
– see through the marketing hype!
 Why is some hardware better than others for different programs?
 What factors of system performance are hardware related?
(e.g., do we need a new machine, or a new operating system?)
 How does the machine's instruction set affect performance?
Performance
Outline of the rest
 Time and performance
 Formulation for Performance Evaluation
 MIPS and MFLOPS
 Amdahl’s law
 Response Time (elapsed time, latency):
– how long does it take for my job to run?
– how long does it take to execute (start to
finish) my job?
– how long must I wait for the database query?
 Throughput:
– how many jobs can the machine run at once?
– what is the average execution rate?
– how much work is getting done?
 If we upgrade a machine with a new processor what do we increase?
 If we add a new machine to the lab what do we increase?
Computer Performance: TIME,
TIME, TIME!!!
Individual user
concerns…
Systems manager
concerns…
Defining Performance
 What is important to whom?
 Computer system user
– Minimize elapsed time for program =
time_end – time_start
– Called response time
 Computer center manager
– Maximize completion rate = #jobs/second
– Called throughput
Response Time vs. Throughput
 Is throughput = 1/av. response time?
– Only if NO overlap
– Otherwise, throughput > 1/av. response time
– E.g. a lunch buffet – assume 5 entrees
– Each person takes 2 minutes/entrée
– Throughput is 1 person every 2 minutes
– BUT time to fill up tray is 10 minutes
– Why and what would the throughput be otherwise?
 5 people simultaneously filling tray (overlap)
 Without overlap, throughput = 1/10
What is Performance for us?
 For computer architects
– CPU time = time spent running a program
 Intuitively, bigger should be faster, so:
– Performance = 1/X time, where X is response,
CPU execution, etc.
 Elapsed time = CPU time + I/O wait
 We will concentrate on CPU time
 Elapsed Time
– counts everything (disk and memory accesses, waiting for I/O, running
other programs, etc.) from start to finish
– a useful number, but often not good for comparison purposes
elapsed time = CPU time + wait time (I/O, other programs, etc.)
 CPU time
– doesn't count waiting for I/O or time spent running other programs
– can be divided into user CPU time and system CPU time (OS calls)
CPU time = user CPU time + system CPU time
 elapsed time = user CPU time + system CPU time + wait time
 Our focus: user CPU time (CPU execution time or, simply, execution
time)
– time spent executing the lines of code that are in our program
Execution Time
 For some program running on machine X:
PerformanceX = 1 / Execution timeX
 X is n times faster than Y means:
PerformanceX / PerformanceY = n
Definition of Performance
Improve Performance
 Improve (a) response time or (b)
throughput?
– Faster CPU
 Helps both (a) and (b)
– Add more CPUs
 Helps (b) and perhaps (a) due to less queueing
Performance Comparison
 Machine A is n times faster than machine B iff
perf(A)/perf(B) = time(B)/time(A) = n
 Machine A is x% faster than machine B iff
– perf(A)/perf(B) = time(B)/time(A) = 1 + x/100
 E.g. time(A) = 10s, time(B) = 15s
– 15/10 = 1.5 => A is 1.5 times faster than B
– 15/10 = 1.5 => A is 50% faster than B
Breaking Down Performance
 A program is broken into instructions
– hardware is aware of instructions, not programs
 At lower level, hardware breaks instructions into
cycles
– Lower level state machines change state every cycle
 For example:
– 1GHz Snapdragon runs 1000M cycles/sec, 1 cycle =
1ns
– 2.5GHz Core i7 runs 2.5G cycles/sec, 1 cycle = 0.40ns
Clock Cycles
 Instead of reporting execution time in seconds, we often use cycles. In
modern computers hardware events progress cycle by cycle: in other
words, each event, e.g., multiplication, addition, etc., is a sequence of
cycles
 Clock ticks indicate start and end of cycles:
 cycle time = time between ticks = seconds per cycle
 clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec, 1
MHz. = 106 cycles/sec)
 Example: A 200 Mhz. clock has a
cycle time
time
seconds
program

cycles
program

seconds
cycle
1
200 106
109  5 nanoseconds
cycle
tick
tick
Performance Equation I
 So, to improve performance one can either:
– reduce the number of cycles for a program, or
– reduce the clock cycle time, or, equivalently,
– increase the clock rate
seconds
program

cycles
program

seconds
cycle
CPU execution time CPU clock cycles Clock cycle time
for a program for a program
=

equivalently
 Could assume that # of cycles = # of instructions
time
1st
instruction
2nd
instruction
3rd
instruction
4th
5th
6th
...
How many cycles are required
for a program?
 This assumption is incorrect! Because:
 Different instructions take different amounts of time (cycles)
 Why…?
 Multiplication takes more time than addition
 Floating point operations take longer than integer ones
 Accessing memory takes more time than accessing registers
 Important point: changing the cycle time often changes the number of
cycles required for various instructions because it means changing the
hardware design.
time
How many cycles are required
for a program?
 A given program will require:
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
 We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– (average) CPI (cycles per instruction)
 a floating point intensive application might have a higher average CPI
– MIPS (millions of instructions per second)
 this would be higher for a program using simple instructions
Terminology
Performance Equation II
CPU execution time Instruction count average CPI Clock cycle time
for a program for a program
 Derive the above equation from Performance Equation I
=
 
Performance Factors
Processor Performance = ---------------
Time
Program
Architecture --> Implementation --> Realization
Compiler Designer Processor Designer Chip Designer
Instructions Cycles
Program Instruction
Time
Cycle
(code size)
= X X
(CPI) (cycle time)
Performance Factors
 Instructions/Program
– Instructions executed, not static code size
– Determined by algorithm, compiler, ISA
 Cycles/Instruction
– Determined by ISA and CPU organization
– Overlap among instructions reduces this factor
 Time/cycle
– Determined by technology, organization, clever
circuit design
Other Metrics
 MIPS and MFLOPS
 MIPS = instruction count/(execution time x 106)
= clock rate/(CPI x 106)
 But MIPS has serious shortcomings
Problems with MIPS
 E.g. without floating-point hardware, an FP op
may take 50 single-cycle instructions at a clock
rate of say 1MHz
 With FP hardware, only one 2-cycle instruction
at the same clock rate
 Thus, adding FP hardware:
– Instructions/program
decreases
– CPI increases
– Total execution time decreases
 BUT, MIPS gets worse!
50 => 1
50/50=1 => 2/1=2
50x1/1M => 1x2/1M
1 MIPS => 0.5 MIPS
50 /(50 x 10-6 x 106)
=1 MIPS
1/(2 x 10-6 x 106)
=0.5 MIPS
Problems with MIPS
 Ignores program
 Usually used to quote peak performance
– Ideal conditions => guaranteed not to exceed!
 When is MIPS ok?
– Same compiler, same ISA
– E.g. same binary running on AMD Phenom,
Intel Core i7
– Why? Instr/program is constant and can be
ignored
Other Metrics
 MFLOPS = FP ops in program/(execution time x 106)
 Assuming FP ops independent of compiler and
ISA
– Often safe for numeric codes: matrix size determines
# of FP ops/program
– However, not always safe:
 Missing instructions (e.g. FP divide)
 Optimizing compilers
 Relative MIPS and normalized MFLOPS
– Adds to confusion
 Use ONLY Time
Our Goal
 Minimize time which is the product, NOT
isolated factors
 Common error is to miss factors while
devising optimizations
– E.g. ISA change to decrease instruction count
– BUT leads to CPU organization which makes
clock slower
 Bottom line: factors are inter-related
Example 1
 Machine A: clock 1ns, CPI 2.0, for program x
 Machine B: clock 2ns, CPI 1.2, for program x
 Which is faster and how much?
Time/Program = instr/program x cycles/instr x sec/cycle
Time(A) = N x 2.0 x 1 = 2N
Time(B) = N x 1.2 x 2 = 2.4N
Compare: Time(B)/Time(A) = 2.4N/2N = 1.2
 So, Machine A is 20% faster than Machine B for
this program
Example 2
Keep clock(A) @ 1ns and clock(B) @2ns
For equal performance, if CPI(B)=1.2, what
is CPI(A)?
Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A))
CPI(A) = 2.4
Example 3
 Keep CPI(A)=2.0 and CPI(B)=1.2
 For equal performance, if clock(B)=2ns,
what is clock(A)?
Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2)
clock(A) = 1.2ns
Which Programs
 Execution time of what program?
 Best case – you always run the same set of
programs
– Port them and time the whole workload
 In reality, use benchmarks
– Programs chosen to measure performance
– Predict performance of actual workload
– Saves effort and money
 Performance best determined by running a real application
– use programs typical of expected workload
– or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
 Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused!
 Benchmark suites
– Perfect Club: set of application codes
– Livermore Loops: 24 loop kernels
– Linpack: linear algebra package
– SPEC: mix of code from industry organization
Benchmarks
SPEC (System Performance
Evaluation Corporation)
 Sponsored by industry but independent and self-managed – trusted
by code developers and machine vendors
 Clear guides for testing, see www.spec.org
 Regular updates (benchmarks are dropped and new ones added
periodically according to relevance)
 Specialized benchmarks for particular classes of applications
 Can still be abused…, by selective optimization!
SPEC History
 First Round: SPEC CPU89
– 10 programs yielding a single number
 Second Round: SPEC CPU92
– SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point
programs)
– compiler flags can be set differently for different programs
 Third Round: SPEC CPU95
– new set of programs: SPEC CINT95 (8 integer programs) and SPEC
CFP95 (10 floating point)
– single flag setting for all programs
 Fourth Round: SPEC CPU2000
– new set of programs: SPEC CINT2000 (12 integer programs) and
SPEC CFP2000 (14 floating point)
– single flag setting for all programs
– programs in C, C++, Fortran 77, and Fortran 90
Benchmarks: SPEC2000
 12 integer and 14 floating-point programs
– Sun Ultra-5 300MHz reference machine has
score of 100
Benchmarks: SPEC CINT2000
Benchmark Description
164.gzip Compression
175.vpr FPGA place and route
176.gcc C compiler
181.mcf Combinatorial optimization
186.crafty Chess
197.parser Word processing, grammatical analysis
252.eon Visualization (ray tracing)
253.perlbmk PERL script execution
254.gap Group theory interpreter
255.vortex Object-oriented database
256.bzip2 Compression
300.twolf Place and route simulator
Benchmarks: SPEC CFP2000
Benchmark Description
168.wupwise Physics/Quantum Chromodynamics
171.swim Shallow water modeling
172.mgrid Multi-grid solver: 3D potential field
173.applu Parabolic/elliptic PDE
177.mesa 3-D graphics library
178.galgel Computational Fluid Dynamics
179.art Image Recognition/Neural Networks
183.equake Seismic Wave Propagation Simulation
187.facerec Image processing: face recognition
188.ammp Computational chemistry
189.lucas Number theory/primality testing
191.fma3d Finite-element Crash Simulation
200.sixtrack High energy nuclear physics accelerator design
301.apsi Meteorology: Pollutant distribution
SPEC CPU2000 reporting
 Refer SPEC website www.spec.org for documentation
 Single number result – geometric mean of normalized ratios for each
code in the suite
 Report precise description of machine
 Report compiler flag setting
Benchmark Pitfalls
 Benchmark not representative
– Your workload is I/O bound, SPEC is useless
 Benchmark is too old
– Benchmarks age poorly; benchmarketing
pressure causes vendors to optimize
compiler/hardware/software to benchmarks
– Need to be periodically refreshed
Specialized SPEC Benchmarks
 I/O
 Network
 Graphics
 Java
 Web server
 Transaction processing (databases)
Amdahl’s Law: Quantitative
Principles of Computer Design
 Amdahl’s Law:
The performance gain from improving
some portion of a computer is calculated
by:
Speedup = Performance for entire task using the enhancement
Performance for the entire task without using the enhancement
or Speedup = Execution time without the enhancement
Execution time for entire task using the enhancement
Here: Task = Program Recall: Performance = 1 /Execution Time
i.e using some enhancement
Performance Enhancement
Calculations: Amdahl's Law
 The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is used
 Amdahl’s Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = -------------------------------------- = ---------------------------------
Execution Time with E Performance without E
– Suppose that enhancement E accelerates a fraction F of the execution
time by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1
Speedup(E) = --------------------------------------------------------- = --------------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S
F (Fraction of execution time enhanced) refers
to original execution time before the enhancement is applied
original
Pictorial Depiction of
Amdahl’s Law
Before:
Execution Time without enhancement E: (Before enhancement is applied)
After:
Execution Time with enhancement E:
Enhancement E accelerates fraction F of original execution time by a factor of S
Unaffected fraction: (1- F) Affected fraction: F
Unaffected fraction: (1- F) F/S
Unchanged
Execution Time without enhancement E 1
Speedup(E) = ------------------------------------------------------ = ------------------
Execution Time with enhancement E (1 - F) + F/S
• shown normalized to 1 = (1-F) + F =1
What if the fraction given is
after the enhancement has been applied?
How would you solve the problem?
(i.e find expression for speedup)
Example 1
 For a RISC machine with the following instruction mix:
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
 If a CPU design enhancement improves the CPI of load instructions from
5 to 2, what is the resulting performance improvement from this
enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 1- F = 100% - 45% = 55% or .55
Factor of enhancement = S = 5/2 = 2.5
Using Amdahl’s Law:
1 1
Speedup(E) = ------------------ = --------------------- = 1.37
(1 - F) + F/S .55 + .45/2.5
CPI = 2.2
Example 1 Alternative solution
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
 If a CPU design enhancement improves the CPI of load instructions from
5 to 2, what is the resulting performance improvement from this
enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time Instruction count x old CPI x clock cycle
Speedup(E) = ----------------------------------- = ----------------------------------------------------------------
New Execution Time Instruction count x new CPI x clock cycle
old CPI 2.2
= ------------ = --------- = 1.37
new CPI 1.6
Which is the same speedup obtained from Amdahl’s Law in the first solution.
CPI = 2.2
T = I x CPI x C
New CPI of load is now 2 instead of 5
Example 2
 A program runs in 100 seconds on a machine with multiply operations
responsible for 80 seconds of this time. By how much must the speed
of multiplication be improved to make the program four times faster?
100
Desired speedup = 4 = -----------------------------------------------------
Execution Time with enhancement
Execution time with enhancement = 100/4 = 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / S
25 seconds = 20 seconds + 80 seconds / S
 5 = 80 seconds / S
 S = 80/5 = 16
Alternatively, it can also be solved by finding enhanced fraction of execution time:
F = 80/100 = .8
and then solving Amdahl’s speedup equation for desired enhancement factor S
Hence multiplication should be 16 times
faster to get an overall speedup of 4.
1 1 1
Speedup(E) = ------------------ = 4 = ----------------- = ---------------
(1 - F) + F/S (1 - .8) + .8/S .2 + .8/s
Solving for S gives S= 16
Machine = CPU
Example 3
 For the previous example with a program running in 100 seconds on a
machine with multiply operations responsible for 80 seconds of this
time. By how much must the speed of multiplication be improved to
make the program five times faster?
100
Desired speedup = 5 = -----------------------------------------------------
Execution Time with enhancement
Execution time with enhancement = 100/5 = 20 seconds
20 seconds = (100 - 80 seconds) + 80 seconds / s
20 seconds = 20 seconds + 80 seconds / s
 0 = 80 seconds / s
No amount of multiplication speed improvement can achieve this.
Multiple Enhancements
 Suppose that enhancement Ei accelerates a fraction Fi of the
original execution time by a factor Si and the remainder of the
time is unaffected then:
 



i i
i
i
i
X
S
F
F
Speedup
Time
Execution
Original
)
1
Time
Execution
Original
)
((
 



i i
i
i
i
S
F
F
Speedup
)
( )
1
1
(
Note: All fractions Fi refer to original execution time before the
enhancements are applied.
Unaffected fraction
i = 1, 2, …. n
What if the fractions given are
after the enhancements were applied?
How would you solve the problem?
(i.e find expression for speedup)
n enhancements each affecting a different portion of execution time
Multiple Enhancements Example
 Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:
Speedup1 = S1 = 10 Percentage1 = F1 = 20%
Speedup2 = S2 = 15 Percentage1 = F2 = 15%
Speedup3 = S3 = 30 Percentage1 = F3 = 10%
 While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
 What is the resulting overall speedup?
 Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1 / [ .55 + .0333 ]
= 1 / .5833 = 1.71
 



i i
i
i
i
S
F
F
Speedup
)
( )
1
1
(
Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions Fi refer to original execution time.
Unaffected, fraction: .55
Unchanged
Unaffected, fraction: .55 F1 = .2 F2 = .15 F3 = .1
S1 = 10 S2 = 15 S3 = 30
/ 10 / 30
/ 15
What if the fractions given are
after the enhancements were applied?
How would you solve the problem?
i.e normalized to 1
Summary
 Time and performance: Machine A n times
faster than Machine B
– Iff Time(B)/Time(A) = n
 Performance = Time/program =
Instructions Cycles
Program Instruction
Time
Cycle
(code size)
= X X
(CPI) (cycle time)
Summary Cont’d
 Other Metrics: MIPS and MFLOPS
– Beware of peak and omitted details
 Benchmarks: SPEC2000
 Amdahl’s Law:
s
f
f
Speedup



1
1

More Related Content

Similar to L07_performance and cost in advanced hardware- computer architecture.pptx

COMPUTER ARCHITECTURE BASIC CONCEPT
COMPUTER ARCHITECTURE BASIC CONCEPTCOMPUTER ARCHITECTURE BASIC CONCEPT
COMPUTER ARCHITECTURE BASIC CONCEPTAzizul Mamun
 
performance uploading.pptx
performance uploading.pptxperformance uploading.pptx
performance uploading.pptxSanthiS10
 
Computer Architecture Performance and Energy
Computer Architecture Performance and EnergyComputer Architecture Performance and Energy
Computer Architecture Performance and EnergyJason J Pulikkottil
 
Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01manishajadhav13j
 
Performance of processor.ppt
Performance of processor.pptPerformance of processor.ppt
Performance of processor.pptnivedita murugan
 
L-2 (Computer Performance).ppt
L-2 (Computer Performance).pptL-2 (Computer Performance).ppt
L-2 (Computer Performance).pptImranKhan997082
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceHsien-Hsin Sean Lee, Ph.D.
 
Chapter1 basic structure of computers
Chapter1  basic structure of computersChapter1  basic structure of computers
Chapter1 basic structure of computersjismymathew
 
Basic structure of computers
Basic structure of computersBasic structure of computers
Basic structure of computersKumar
 
Basic structure of computers
Basic structure of computersBasic structure of computers
Basic structure of computersKumar
 
Lecture 3
Lecture 3Lecture 3
Lecture 3Mr SMAK
 
Basic structure of computers by aniket bhute
Basic structure of computers by aniket bhuteBasic structure of computers by aniket bhute
Basic structure of computers by aniket bhuteAniket Bhute
 
chapter 1 -Basic Structure of Computers.ppt
chapter 1 -Basic Structure of Computers.pptchapter 1 -Basic Structure of Computers.ppt
chapter 1 -Basic Structure of Computers.pptsandeepPingili1
 

Similar to L07_performance and cost in advanced hardware- computer architecture.pptx (20)

COMPUTER ARCHITECTURE BASIC CONCEPT
COMPUTER ARCHITECTURE BASIC CONCEPTCOMPUTER ARCHITECTURE BASIC CONCEPT
COMPUTER ARCHITECTURE BASIC CONCEPT
 
Process management
Process managementProcess management
Process management
 
performance uploading.pptx
performance uploading.pptxperformance uploading.pptx
performance uploading.pptx
 
Computer Architecture Performance and Energy
Computer Architecture Performance and EnergyComputer Architecture Performance and Energy
Computer Architecture Performance and Energy
 
Computer performance
Computer performanceComputer performance
Computer performance
 
Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
 
Performance of processor.ppt
Performance of processor.pptPerformance of processor.ppt
Performance of processor.ppt
 
L-2 (Computer Performance).ppt
L-2 (Computer Performance).pptL-2 (Computer Performance).ppt
L-2 (Computer Performance).ppt
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
 
Chapter1 basic structure of computers
Chapter1  basic structure of computersChapter1  basic structure of computers
Chapter1 basic structure of computers
 
Basic structure of computers
Basic structure of computersBasic structure of computers
Basic structure of computers
 
Basic structure of computers
Basic structure of computersBasic structure of computers
Basic structure of computers
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
Basic structure of computers by aniket bhute
Basic structure of computers by aniket bhuteBasic structure of computers by aniket bhute
Basic structure of computers by aniket bhute
 
chapter 1 -Basic Structure of Computers.ppt
chapter 1 -Basic Structure of Computers.pptchapter 1 -Basic Structure of Computers.ppt
chapter 1 -Basic Structure of Computers.ppt
 

Recently uploaded

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 

Recently uploaded (20)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 

L07_performance and cost in advanced hardware- computer architecture.pptx

  • 1. COE 585: Advanced Computer Architecture Lecture 6: Performance and Cost Instructor: K. O. Boateng Semester 1, 2023/24 Dept. of Comp. Engg., KNUST, Kumasi
  • 2. Performance and Cost  Which of the following airplanes has the best performance?  How much faster is the Concorde vs. the 747  How much bigger is the 747 vs. DC-8?
  • 3. Performance and Cost  Which computer is fastest?  Not so simple – Scientific simulation – FP performance – Program development – Integer performance – Database workload – Memory, I/O
  • 4. Performance of Computers  Want to buy the fastest computer for what you want to do? – Workload is all-important – Correct measurement and analysis  Want to design the fastest computer for what the customer wants to pay? – Cost is an important criterion
  • 5.  Performance is the key to understanding underlying motivation for the hardware and its organization  Measure, report, and summarize performance to enable users to – make intelligent choices – see through the marketing hype!  Why is some hardware better than others for different programs?  What factors of system performance are hardware related? (e.g., do we need a new machine, or a new operating system?)  How does the machine's instruction set affect performance? Performance
  • 6. Outline of the rest  Time and performance  Formulation for Performance Evaluation  MIPS and MFLOPS  Amdahl’s law
  • 7.  Response Time (elapsed time, latency): – how long does it take for my job to run? – how long does it take to execute (start to finish) my job? – how long must I wait for the database query?  Throughput: – how many jobs can the machine run at once? – what is the average execution rate? – how much work is getting done?  If we upgrade a machine with a new processor what do we increase?  If we add a new machine to the lab what do we increase? Computer Performance: TIME, TIME, TIME!!! Individual user concerns… Systems manager concerns…
  • 8. Defining Performance  What is important to whom?  Computer system user – Minimize elapsed time for program = time_end – time_start – Called response time  Computer center manager – Maximize completion rate = #jobs/second – Called throughput
  • 9. Response Time vs. Throughput  Is throughput = 1/av. response time? – Only if NO overlap – Otherwise, throughput > 1/av. response time – E.g. a lunch buffet – assume 5 entrees – Each person takes 2 minutes/entrée – Throughput is 1 person every 2 minutes – BUT time to fill up tray is 10 minutes – Why and what would the throughput be otherwise?  5 people simultaneously filling tray (overlap)  Without overlap, throughput = 1/10
  • 10. What is Performance for us?  For computer architects – CPU time = time spent running a program  Intuitively, bigger should be faster, so: – Performance = 1/X time, where X is response, CPU execution, etc.  Elapsed time = CPU time + I/O wait  We will concentrate on CPU time
  • 11.  Elapsed Time – counts everything (disk and memory accesses, waiting for I/O, running other programs, etc.) from start to finish – a useful number, but often not good for comparison purposes elapsed time = CPU time + wait time (I/O, other programs, etc.)  CPU time – doesn't count waiting for I/O or time spent running other programs – can be divided into user CPU time and system CPU time (OS calls) CPU time = user CPU time + system CPU time  elapsed time = user CPU time + system CPU time + wait time  Our focus: user CPU time (CPU execution time or, simply, execution time) – time spent executing the lines of code that are in our program Execution Time
  • 12.  For some program running on machine X: PerformanceX = 1 / Execution timeX  X is n times faster than Y means: PerformanceX / PerformanceY = n Definition of Performance
  • 13. Improve Performance  Improve (a) response time or (b) throughput? – Faster CPU  Helps both (a) and (b) – Add more CPUs  Helps (b) and perhaps (a) due to less queueing
  • 14. Performance Comparison  Machine A is n times faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = n  Machine A is x% faster than machine B iff – perf(A)/perf(B) = time(B)/time(A) = 1 + x/100  E.g. time(A) = 10s, time(B) = 15s – 15/10 = 1.5 => A is 1.5 times faster than B – 15/10 = 1.5 => A is 50% faster than B
  • 15. Breaking Down Performance  A program is broken into instructions – hardware is aware of instructions, not programs  At lower level, hardware breaks instructions into cycles – Lower level state machines change state every cycle  For example: – 1GHz Snapdragon runs 1000M cycles/sec, 1 cycle = 1ns – 2.5GHz Core i7 runs 2.5G cycles/sec, 1 cycle = 0.40ns
  • 16. Clock Cycles  Instead of reporting execution time in seconds, we often use cycles. In modern computers hardware events progress cycle by cycle: in other words, each event, e.g., multiplication, addition, etc., is a sequence of cycles  Clock ticks indicate start and end of cycles:  cycle time = time between ticks = seconds per cycle  clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec, 1 MHz. = 106 cycles/sec)  Example: A 200 Mhz. clock has a cycle time time seconds program  cycles program  seconds cycle 1 200 106 109  5 nanoseconds cycle tick tick
  • 17. Performance Equation I  So, to improve performance one can either: – reduce the number of cycles for a program, or – reduce the clock cycle time, or, equivalently, – increase the clock rate seconds program  cycles program  seconds cycle CPU execution time CPU clock cycles Clock cycle time for a program for a program =  equivalently
  • 18.  Could assume that # of cycles = # of instructions time 1st instruction 2nd instruction 3rd instruction 4th 5th 6th ... How many cycles are required for a program?  This assumption is incorrect! Because:  Different instructions take different amounts of time (cycles)  Why…?
  • 19.  Multiplication takes more time than addition  Floating point operations take longer than integer ones  Accessing memory takes more time than accessing registers  Important point: changing the cycle time often changes the number of cycles required for various instructions because it means changing the hardware design. time How many cycles are required for a program?
  • 20.  A given program will require: – some number of instructions (machine instructions) – some number of cycles – some number of seconds  We have a vocabulary that relates these quantities: – cycle time (seconds per cycle) – clock rate (cycles per second) – (average) CPI (cycles per instruction)  a floating point intensive application might have a higher average CPI – MIPS (millions of instructions per second)  this would be higher for a program using simple instructions Terminology
  • 21. Performance Equation II CPU execution time Instruction count average CPI Clock cycle time for a program for a program  Derive the above equation from Performance Equation I =  
  • 22. Performance Factors Processor Performance = --------------- Time Program Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer Instructions Cycles Program Instruction Time Cycle (code size) = X X (CPI) (cycle time)
  • 23. Performance Factors  Instructions/Program – Instructions executed, not static code size – Determined by algorithm, compiler, ISA  Cycles/Instruction – Determined by ISA and CPU organization – Overlap among instructions reduces this factor  Time/cycle – Determined by technology, organization, clever circuit design
  • 24. Other Metrics  MIPS and MFLOPS  MIPS = instruction count/(execution time x 106) = clock rate/(CPI x 106)  But MIPS has serious shortcomings
  • 25. Problems with MIPS  E.g. without floating-point hardware, an FP op may take 50 single-cycle instructions at a clock rate of say 1MHz  With FP hardware, only one 2-cycle instruction at the same clock rate  Thus, adding FP hardware: – Instructions/program decreases – CPI increases – Total execution time decreases  BUT, MIPS gets worse! 50 => 1 50/50=1 => 2/1=2 50x1/1M => 1x2/1M 1 MIPS => 0.5 MIPS 50 /(50 x 10-6 x 106) =1 MIPS 1/(2 x 10-6 x 106) =0.5 MIPS
  • 26. Problems with MIPS  Ignores program  Usually used to quote peak performance – Ideal conditions => guaranteed not to exceed!  When is MIPS ok? – Same compiler, same ISA – E.g. same binary running on AMD Phenom, Intel Core i7 – Why? Instr/program is constant and can be ignored
  • 27. Other Metrics  MFLOPS = FP ops in program/(execution time x 106)  Assuming FP ops independent of compiler and ISA – Often safe for numeric codes: matrix size determines # of FP ops/program – However, not always safe:  Missing instructions (e.g. FP divide)  Optimizing compilers  Relative MIPS and normalized MFLOPS – Adds to confusion  Use ONLY Time
  • 28. Our Goal  Minimize time which is the product, NOT isolated factors  Common error is to miss factors while devising optimizations – E.g. ISA change to decrease instruction count – BUT leads to CPU organization which makes clock slower  Bottom line: factors are inter-related
  • 29. Example 1  Machine A: clock 1ns, CPI 2.0, for program x  Machine B: clock 2ns, CPI 1.2, for program x  Which is faster and how much? Time/Program = instr/program x cycles/instr x sec/cycle Time(A) = N x 2.0 x 1 = 2N Time(B) = N x 1.2 x 2 = 2.4N Compare: Time(B)/Time(A) = 2.4N/2N = 1.2  So, Machine A is 20% faster than Machine B for this program
  • 30. Example 2 Keep clock(A) @ 1ns and clock(B) @2ns For equal performance, if CPI(B)=1.2, what is CPI(A)? Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A)) CPI(A) = 2.4
  • 31. Example 3  Keep CPI(A)=2.0 and CPI(B)=1.2  For equal performance, if clock(B)=2ns, what is clock(A)? Time(B)/Time(A) = 1 = (N x 2.0 x clock(A))/(N x 1.2 x 2) clock(A) = 1.2ns
  • 32. Which Programs  Execution time of what program?  Best case – you always run the same set of programs – Port them and time the whole workload  In reality, use benchmarks – Programs chosen to measure performance – Predict performance of actual workload – Saves effort and money
  • 33.  Performance best determined by running a real application – use programs typical of expected workload – or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc.  Small benchmarks – nice for architects and designers – easy to standardize – can be abused!  Benchmark suites – Perfect Club: set of application codes – Livermore Loops: 24 loop kernels – Linpack: linear algebra package – SPEC: mix of code from industry organization Benchmarks
  • 34. SPEC (System Performance Evaluation Corporation)  Sponsored by industry but independent and self-managed – trusted by code developers and machine vendors  Clear guides for testing, see www.spec.org  Regular updates (benchmarks are dropped and new ones added periodically according to relevance)  Specialized benchmarks for particular classes of applications  Can still be abused…, by selective optimization!
  • 35. SPEC History  First Round: SPEC CPU89 – 10 programs yielding a single number  Second Round: SPEC CPU92 – SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point programs) – compiler flags can be set differently for different programs  Third Round: SPEC CPU95 – new set of programs: SPEC CINT95 (8 integer programs) and SPEC CFP95 (10 floating point) – single flag setting for all programs  Fourth Round: SPEC CPU2000 – new set of programs: SPEC CINT2000 (12 integer programs) and SPEC CFP2000 (14 floating point) – single flag setting for all programs – programs in C, C++, Fortran 77, and Fortran 90
  • 36. Benchmarks: SPEC2000  12 integer and 14 floating-point programs – Sun Ultra-5 300MHz reference machine has score of 100
  • 37. Benchmarks: SPEC CINT2000 Benchmark Description 164.gzip Compression 175.vpr FPGA place and route 176.gcc C compiler 181.mcf Combinatorial optimization 186.crafty Chess 197.parser Word processing, grammatical analysis 252.eon Visualization (ray tracing) 253.perlbmk PERL script execution 254.gap Group theory interpreter 255.vortex Object-oriented database 256.bzip2 Compression 300.twolf Place and route simulator
  • 38. Benchmarks: SPEC CFP2000 Benchmark Description 168.wupwise Physics/Quantum Chromodynamics 171.swim Shallow water modeling 172.mgrid Multi-grid solver: 3D potential field 173.applu Parabolic/elliptic PDE 177.mesa 3-D graphics library 178.galgel Computational Fluid Dynamics 179.art Image Recognition/Neural Networks 183.equake Seismic Wave Propagation Simulation 187.facerec Image processing: face recognition 188.ammp Computational chemistry 189.lucas Number theory/primality testing 191.fma3d Finite-element Crash Simulation 200.sixtrack High energy nuclear physics accelerator design 301.apsi Meteorology: Pollutant distribution
  • 39. SPEC CPU2000 reporting  Refer SPEC website www.spec.org for documentation  Single number result – geometric mean of normalized ratios for each code in the suite  Report precise description of machine  Report compiler flag setting
  • 40. Benchmark Pitfalls  Benchmark not representative – Your workload is I/O bound, SPEC is useless  Benchmark is too old – Benchmarks age poorly; benchmarketing pressure causes vendors to optimize compiler/hardware/software to benchmarks – Need to be periodically refreshed
  • 41. Specialized SPEC Benchmarks  I/O  Network  Graphics  Java  Web server  Transaction processing (databases)
  • 42. Amdahl’s Law: Quantitative Principles of Computer Design  Amdahl’s Law: The performance gain from improving some portion of a computer is calculated by: Speedup = Performance for entire task using the enhancement Performance for the entire task without using the enhancement or Speedup = Execution time without the enhancement Execution time for entire task using the enhancement Here: Task = Program Recall: Performance = 1 /Execution Time i.e using some enhancement
  • 43. Performance Enhancement Calculations: Amdahl's Law  The performance enhancement possible due to a given design improvement is limited by the amount that the improved feature is used  Amdahl’s Law: Performance improvement or speedup due to enhancement E: Execution Time without E Performance with E Speedup(E) = -------------------------------------- = --------------------------------- Execution Time with E Performance without E – Suppose that enhancement E accelerates a fraction F of the execution time by a factor S and the remainder of the time is unaffected then: Execution Time with E = ((1-F) + F/S) X Execution Time without E Hence speedup is given by: Execution Time without E 1 Speedup(E) = --------------------------------------------------------- = -------------------- ((1 - F) + F/S) X Execution Time without E (1 - F) + F/S F (Fraction of execution time enhanced) refers to original execution time before the enhancement is applied original
  • 44. Pictorial Depiction of Amdahl’s Law Before: Execution Time without enhancement E: (Before enhancement is applied) After: Execution Time with enhancement E: Enhancement E accelerates fraction F of original execution time by a factor of S Unaffected fraction: (1- F) Affected fraction: F Unaffected fraction: (1- F) F/S Unchanged Execution Time without enhancement E 1 Speedup(E) = ------------------------------------------------------ = ------------------ Execution Time with enhancement E (1 - F) + F/S • shown normalized to 1 = (1-F) + F =1 What if the fraction given is after the enhancement has been applied? How would you solve the problem? (i.e find expression for speedup)
  • 45. Example 1  For a RISC machine with the following instruction mix: Op Freq Cycles CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18%  If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement: Fraction enhanced = F = 45% or .45 Unaffected fraction = 1- F = 100% - 45% = 55% or .55 Factor of enhancement = S = 5/2 = 2.5 Using Amdahl’s Law: 1 1 Speedup(E) = ------------------ = --------------------- = 1.37 (1 - F) + F/S .55 + .45/2.5 CPI = 2.2
  • 46. Example 1 Alternative solution Op Freq Cycles CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18%  If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement: Old CPI = 2.2 New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6 Original Execution Time Instruction count x old CPI x clock cycle Speedup(E) = ----------------------------------- = ---------------------------------------------------------------- New Execution Time Instruction count x new CPI x clock cycle old CPI 2.2 = ------------ = --------- = 1.37 new CPI 1.6 Which is the same speedup obtained from Amdahl’s Law in the first solution. CPI = 2.2 T = I x CPI x C New CPI of load is now 2 instead of 5
  • 47. Example 2  A program runs in 100 seconds on a machine with multiply operations responsible for 80 seconds of this time. By how much must the speed of multiplication be improved to make the program four times faster? 100 Desired speedup = 4 = ----------------------------------------------------- Execution Time with enhancement Execution time with enhancement = 100/4 = 25 seconds 25 seconds = (100 - 80 seconds) + 80 seconds / S 25 seconds = 20 seconds + 80 seconds / S  5 = 80 seconds / S  S = 80/5 = 16 Alternatively, it can also be solved by finding enhanced fraction of execution time: F = 80/100 = .8 and then solving Amdahl’s speedup equation for desired enhancement factor S Hence multiplication should be 16 times faster to get an overall speedup of 4. 1 1 1 Speedup(E) = ------------------ = 4 = ----------------- = --------------- (1 - F) + F/S (1 - .8) + .8/S .2 + .8/s Solving for S gives S= 16 Machine = CPU
  • 48. Example 3  For the previous example with a program running in 100 seconds on a machine with multiply operations responsible for 80 seconds of this time. By how much must the speed of multiplication be improved to make the program five times faster? 100 Desired speedup = 5 = ----------------------------------------------------- Execution Time with enhancement Execution time with enhancement = 100/5 = 20 seconds 20 seconds = (100 - 80 seconds) + 80 seconds / s 20 seconds = 20 seconds + 80 seconds / s  0 = 80 seconds / s No amount of multiplication speed improvement can achieve this.
  • 49. Multiple Enhancements  Suppose that enhancement Ei accelerates a fraction Fi of the original execution time by a factor Si and the remainder of the time is unaffected then:      i i i i i X S F F Speedup Time Execution Original ) 1 Time Execution Original ) ((      i i i i i S F F Speedup ) ( ) 1 1 ( Note: All fractions Fi refer to original execution time before the enhancements are applied. Unaffected fraction i = 1, 2, …. n What if the fractions given are after the enhancements were applied? How would you solve the problem? (i.e find expression for speedup) n enhancements each affecting a different portion of execution time
  • 50. Multiple Enhancements Example  Three CPU performance enhancements are proposed with the following speedups and percentage of the code execution time affected: Speedup1 = S1 = 10 Percentage1 = F1 = 20% Speedup2 = S2 = 15 Percentage1 = F2 = 15% Speedup3 = S3 = 30 Percentage1 = F3 = 10%  While all three enhancements are in place in the new design, each enhancement affects a different portion of the code and only one enhancement can be used at a time.  What is the resulting overall speedup?  Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)] = 1 / [ .55 + .0333 ] = 1 / .5833 = 1.71      i i i i i S F F Speedup ) ( ) 1 1 (
  • 51. Pictorial Depiction of Example Before: Execution Time with no enhancements: 1 After: Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833 Speedup = 1 / .5833 = 1.71 Note: All fractions Fi refer to original execution time. Unaffected, fraction: .55 Unchanged Unaffected, fraction: .55 F1 = .2 F2 = .15 F3 = .1 S1 = 10 S2 = 15 S3 = 30 / 10 / 30 / 15 What if the fractions given are after the enhancements were applied? How would you solve the problem? i.e normalized to 1
  • 52. Summary  Time and performance: Machine A n times faster than Machine B – Iff Time(B)/Time(A) = n  Performance = Time/program = Instructions Cycles Program Instruction Time Cycle (code size) = X X (CPI) (cycle time)
  • 53. Summary Cont’d  Other Metrics: MIPS and MFLOPS – Beware of peak and omitted details  Benchmarks: SPEC2000  Amdahl’s Law: s f f Speedup    1 1