SlideShare a Scribd company logo
1 of 53
Download to read offline
Evaluating Computers:
Bigger, better, faster, more?
1
What do you want in a computer?
2
What do you want in a computer?
• Low latency -- one unit of work in minimum time
• 1/latency = responsiveness
• High throughput -- maximum work per time
• High bandwidth (BW)
• Low cost
• Low power -- minimum jules per time
• Low energy -- minimum jules per work
• Reliability -- Mean time to failure (MTTF)
• Derived metrics
• responsiveness/dollar
• BW/$
• BW/Watt
• Work/Jule
• Energy * latency -- Energy delay product
• MTTF/$
3
Latency
• This is the simplest kind of performance
• How long does it take the computer to perform
a task?
• The task at hand depends on the situation.
• Usually measured in seconds
• Also measured in clock cycles
• Caution: if you are comparing two different system, you
must ensure that the cycle times are the same.
4
Measuring Latency
• Stop watch!
• System calls
• gettimeofday()
• System.currentTimeMillis()
• Command line
• time <command>
5
Where latency matters
• Application responsiveness
• Any time a person is waiting.
• GUIs
• Games
• Internet services (from the users perspective)
• “Real-time” applications
• Tight constraints enforced by the real world
• Anti-lock braking systems
• Manufacturing control
• Multi-media applications
• The cost of poor latency
• If you are selling computer time, latency is money.
6
Latency and Performance
• By definition:
• Performance = 1/Latency
• If Performance(X) > Performance(Y), X is faster.
• If Perf(X)/Perf(Y) = S, X is S times faster thanY.
• Equivalently: Latency(Y)/Latency(X) = S
• When we need to talk about specifically about
other kinds of “performance” we must be more
specific.
7
The Performance Equation
• We would like to model how architecture impacts
performance (latency)
• This means we need to quantify performance in
terms of architectural parameters.
• Instructions -- this is the basic unit of work for a
processor
• Cycle time -- these two give us a notion of time.
• Cycles
• The first fundamental theorem of computer
architecture:
Latency = Instructions * Cycles/Instruction *
Seconds/Cycle
8
The Performance Equation
• The units work out! Remember your
dimensional analysis!
• Cycles/Instruction == CPI
• Seconds/Cycle == 1/hz
• Example:
• 1GHz clock
• 1 billion instructions
• CPI = 4
• What is the latency?
9
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
Examples
• gcc runs in 100 sec on a 1 GHz machine
– How many cycles does it take?
• gcc runs in 75 sec on a 600 MHz machine
– How many cycles does it take?
100G cycles
45G cycles
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
How can this be?
• Different Instruction count?
• Different ISAs ?
• Different compilers ?
• Different CPI?
• underlying machine implementation
• Microarchitecture
• Different cycle time?
• New process technology
• Microarchitecture
11
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
Computing Average CPI
• Instruction execution time depends on instruction
time (we’ll get into why this is so later on)
• Integer +, -, <<, |, & -- 1 cycle
• Integer *, /, -- 5-10 cycles
• Floating point +, - -- 3-4 cycles
• Floating point *, /, sqrt() -- 10-30 cycles
• Loads/stores -- variable
• All theses values depend on the particular implementation,
not the ISA
• Total CPI depends on the workload’s Instruction mix
-- how many of each type of instruction executes
• What program is running?
• How was it compiled?
12
The Compiler’s Role
• Compilers affect CPI…
• Wise instruction selection
• “Strength reduction”: x*2n -> x << n
• Use registers to eliminate loads and stores
• More compact code -> less waiting for instructions
• …and instruction count
• Common sub-expression elimination
• Use registers to eliminate loads and stores
13
Stupid Compiler
int i, sum = 0;
for(i=0;i<10;i++)
sum += i;
sw 0($sp), $0 #sum = 0
sw 4($sp), $0 #i = 0
loop:
lw $1, 4($sp)
sub $3, $1, 10
beq $3, $0, end
lw $2, 0($sp)
add $2, $2, $1
st 0($sp), $2
addi $1, $1, 1
st 4($sp), $1
b loop
end:
Type CPI Static # dyn #
mem 5 6 42
int 1 3 30
br 1 2 20
Total 2.8 11 92
(5*42 + 1*30 + 1*20)/92 = 2.8
Smart Compiler
int i, sum = 0;
for(i=0;i<10;i++)
sum += i;
add $1, $0, $0 # i
add $2, $0, $0 # sum
loop:
sub $3, $1, 10
beq $3, $0, end
add $2, $2, $1
addi $1, $1, 1
b loop
end:
sw 0($sp), $2
Type CPI Static # dyn #
mem 5 1 1
int 1 5 32
br 1 2 20
Total 1.01 8 53
(5*1 + 1*32 + 1*20)/53 = 2.8
Live demo
16
Program inputs affect CPI too!
int rand[1000] = {random 0s and 1s }
for(i=0;i<1000;i++)
if(rand[i]) sum -= i;
else sum *= i;
int ones[1000] = {1, 1, ...}
for(i=0;i<1000;i++)
if(ones[i]) sum -= i;
else sum *= i;
• Data-dependent computation
• Data-dependent micro-architectural behavior
–Processors are faster when the computation is
predictable (more later)
Live demo
18
• Meaningful CPI exists only:
• For a particular program with a particular compiler
• ....with a particular input.
• You MUST consider all 3 to get accurate latency estimations
or machine speed comparisons
• Instruction Set
• Compiler
• Implementation of Instruction Set (386 vs Pentium)
• Processor Freq (600 Mhz vs 1 GHz)
• Same high level program with same input
• “wall clock” measurements are always comparable.
• If the workloads (app + inputs) are the same
19
Making Meaningful Comparisons
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
The Performance Equation
• Clock rate =
• Instruction count =
• Latency =
• Find the CPI!
20
Latency = Instructions * Cycles/Instruction * Seconds/Cycle
Today
• DRAM
• Quiz 1 recap
• HW 1 recap
• Questions about ISAs
• More about the project?
• Amdahl’s law
21
Key Points
• Amdahl’s law and how to apply it in a variety of
situations
• It’s role in guiding optimization of a system
• It’s role in determining the impact of localized
changes on the entire system
•
22
Limits on Speedup: Amdahl’s Law
• “The fundamental theorem of performance
optimization”
• Coined by Gene Amdahl (one of the designers of the
IBM 360)
• Optimizations do not (generally) uniformly affect the
entire program
– The more widely applicable a technique is, the more valuable it
is
– Conversely, limited applicability can (drastically) reduce the
impact of an optimization.
Always heed Amdahl’s Law!!!
It is central to many many optimization problems
Amdahl’s Law in Action
• SuperJPEG-O-Rama2000 ISA extensions
**
–Speeds up JPEG decode by 10x!!!
–Act now! While Supplies Last!
** Increases processor cost by 45%
Amdahl’s Law in Action
• SuperJPEG-O-Rama2000 in the wild
• PictoBench spends 33% of it’s time doing
JPEG decode
• How much does JOR2k help?
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Amdahl’s Law in Action
• SuperJPEG-O-Rama2000 in the wild
• PictoBench spends 33% of it’s time doing
JPEG decode
• How much does JOR2k help?
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.4x Speedup != 10x
Amdahl’s Law in Action
• SuperJPEG-O-Rama2000 in the wild
• PictoBench spends 33% of it’s time doing
JPEG decode
• How much does JOR2k help?
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.4x Speedup != 10x
Is this worth the 45% increase in cost?
Amdahl’s Law in Action
• SuperJPEG-O-Rama2000 in the wild
• PictoBench spends 33% of it’s time doing
JPEG decode
• How much does JOR2k help?
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.4x Speedup != 10x
Is this worth the 45% increase in cost?
Amdahl
ate our
Speedup!
Amdahl’s Law
• The second fundamental theorem of computer
architecture.
• If we can speed up X of the program by S times
• Amdahl’s Law gives the total speed up, Stot
Stot = 1 .
(x/S + (1-x))
Amdahl’s Law
• The second fundamental theorem of computer
architecture.
• If we can speed up X of the program by S times
• Amdahl’s Law gives the total speed up, Stot
Stot = 1 .
(x/S + (1-x))
x =1 => Stot = 1 = 1 = S
(1/S + (1-1)) 1/S
Sanity check:
Amdahl’s Corollary #1
• Maximum possible speedup, Smax
Smax = 1
(1-x)
S = infinity
Amdahl’s Law Practice
• Protein String Matching Code
–200 hours to run on current machine, spends 20% of
time doing integer instructions
–How much faster must you make the integer unit to
make the code run 10 hours faster?
–How much faster must you make the integer unit to
make the code run 50 hours faster?
A)1.1
B)1.25
C)1.75
D)1.33
E) 10.0
F) 50.0
G) 1 million times
H) Other
Amdahl’s Law Practice
• Protein String Matching Code
–4 days ET on current machine
• 20% of time doing integer instructions
• 35% percent of time doing I/O
–Which is the better economic tradeoff?
• Compiler optimization that reduces number of
integer instructions by 25% (assume each integer
inst takes the same amount of time)
• Hardware optimization that makes I/O run 20%
faster?
Amdahl’s Law Applies All Over
30
• SSDs use 10x less power than HDs
• But they only save you ~50% overall.
Amdahl’s Law in Memory
31
Memory Device
Rowdecoder
Column decoder
Sense Amps
High order bits
Low order bits
Storage array
DataAddress
• Storage array 90% of area
• Row decoder 4%
• Column decode 2%
• Sense amps 4%
• What’s the benefit of
reducing bit size by 10%?
• Reducing column decoder
size by 90%?
Amdahl’s Corollary #2
• Make the common case fast (i.e., x should be
large)!
–Common == “most time consuming” not necessarily
“most frequent”
–The uncommon case doesn’t make much difference
–Be sure of what the common case is
–The common case changes.
• Repeat…
–With optimization, the common becomes uncommon
and vice versa.
Amdahl’s Corollary #2: Example
Common case
Amdahl’s Corollary #2: Example
Common case
7x => 1.4x
Amdahl’s Corollary #2: Example
Common case
7x => 1.4x
4x => 1.3x
Amdahl’s Corollary #2: Example
Common case
7x => 1.4x
4x => 1.3x
1.3x => 1.1x
Total = 20/10 = 2x
Amdahl’s Corollary #2: Example
Common case
7x => 1.4x
4x => 1.3x
1.3x => 1.1x
Total = 20/10 = 2x
• In the end, there is no common case!
• Options:
– Global optimizations (faster clock, better compiler)
– Find something common to work on (i.e. memory latency)
– War of attrition
– Total redesign (You are probably well-prepared for this)
Amdahl’s Corollary #3
• Benefits of parallel processing
• p processors
• x% is p-way parallizable
• maximum speedup, Spar
Spar = 1 .
(x/p + (1-x))
Amdahl’s Corollary #3
• Benefits of parallel processing
• p processors
• x% is p-way parallizable
• maximum speedup, Spar
Spar = 1 .
(x/p + (1-x))
x is pretty small for desktop applications, even for p = 2
Amdahl’s Corollary #3
• Benefits of parallel processing
• p processors
• x% is p-way parallizable
• maximum speedup, Spar
Spar = 1 .
(x/p + (1-x))
x is pretty small for desktop applications, even for p = 2
Does Intel’s 80-core processor make much sense?
Amdahl’s Corollary #4
• Amdahl’s law for latency (L)
Lnew = Lbase *1/Speedup
Lnew = Lbase *(x/S + (1-x))
Lnew = (Lbase /S)*x + ETbase*(1-x)
• If you can speed up y% of the remaining (1-x), you can apply
Amdahl’s law recursively
Lnew = (Lbase /S1)*x +
(Sbase*(1-x)/S2*y + Lbase*(1-x)*(1-y))
• This is how we will analyze memory system performance
Amdahl’s Non-Corollary
• Amdahl’s law does not bound slowdown
Lnew = (Lbase /S)*x + Lbase*(1-x)
• Lnew is linear in 1/S
• Example: x = 0.01 of execution, Lbase = 1
–S = 0.001;
• Enew = 1000*Lbase *0.01 + Lbase *(0.99) ~ 10*Lbase
–S = 0.00001;
• Enew = 100000*Lbase *0.01 + Lbase *(0.99) ~ 1000*Lbase
• Things can only get so fast, but they can get
arbitrarily slow.
–Do not hurt the non-common case too much!
Benchmarks: Standard Candles for
Performance
• It’s hard to convince manufacturers to run your program
(unless you’re a BIG customer)
• A benchmark is a set of programs that are representative of a
class of problems.
• To increase predictability, collections of benchmark
applications, called benchmark suites, are popular
– “Easy” to set up
– Portable
– Well-understood
– Stand-alone
– Standardized conditions
– These are all things that real software is not.
Classes of benchmarks
• Microbenchmark – measure one feature of system
– e.g. memory accesses or communication speed
• Kernels – most compute-intensive part of applications
– e.g. Linpack and NAS kernel b’marks (for supercomputers)
• Full application:
– SpecInt / SpecFP (int and float) (for Unix workstations)
– Other suites for databases, web servers, graphics,...
Bandwidth
• The amount of work (or data) per time
• MB/s, GB/s -- network BW, disk BW, etc.
• Frames per second -- Games, video transcoding
• (why are games under both latency and BW?)
• Also called “throughput”
39
Measuring Bandwidth
• Measure how much work is done
• Measure latency
• Divide
40
Latency-BW Trade-offs
• Often, increasing latency for one task and
increase BW for many tasks.
• Think of waiting in line for one of 4 bank tellers
• If the line is empty, your response time is minimized, but
throughput is low because utilization is low.
• If there is always a line, you wait longer (your latency
goes up), but there is always work available for tellers.
• Much of computer performance is about
scheduling work onto resources
• Network links.
• Memory ports.
• Processors, functional units, etc.
• IO channels.
• Increasing contention for these resources generally
increases throughput but hurts latency.
41
Stationwagon Digression
• IPv6 Internet 2: 272,400 terabit-meters per second
–585GB in 30 minutes over 30,000 Km
–9.08 Gb/s
• Subaru outback wagon
– Max load = 408Kg
– 21Mpg
• MHX2 BT 300 Laptop drive
– 300GB/Drive
– 0.135Kg
• 906TB
• Legal speed: 75MPH (33.3 m/s)
• BW = 8.2 Gb/s
• Latency = 10 days
• 241,535 terabit-meters per second
Prius Digression
• IPv6 Internet 2: 272,400 terabit-meters per second
–585GB in 30 minutes over 30,000 Km
–9.08 Gb/s
• My Toyota Prius
– Max load = 374Kg
– 44Mpg (2x power efficiency)
• MHX2 BT 300
– 300GB/Drive
– 0.135Kg
• 831TB
• Legal speed: 75MPH (33.3 m/s)
• BW = 7.5 Gb/s
• Latency = 10 days
• 221,407 terabit-meters per second (13%
performance hit)

More Related Content

What's hot

Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating SystemsPawandeep Kaur
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java ApplicationsC4Media
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time SystemsDeepak John
 
Supporting Time-Sensitive Applications on a Commodity OS
Supporting Time-Sensitive Applications on a Commodity OSSupporting Time-Sensitive Applications on a Commodity OS
Supporting Time-Sensitive Applications on a Commodity OSNamHyuk Ahn
 
Real Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsReal Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsHariharan Ganesan
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05Rajesh Gupta
 
Real Time Systems &amp; RTOS
Real Time Systems &amp; RTOSReal Time Systems &amp; RTOS
Real Time Systems &amp; RTOSVishwa Mohan
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlockSyed Zaid Irshad
 
IOT Firmware: Best Pratices
IOT Firmware:  Best PraticesIOT Firmware:  Best Pratices
IOT Firmware: Best Praticesfarmckon
 
Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...ScyllaDB
 
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
LCA14: LCA14-306: CPUidle & CPUfreq integration with schedulerLCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
LCA14: LCA14-306: CPUidle & CPUfreq integration with schedulerLinaro
 
Realtime systems chapter 1
Realtime systems chapter 1Realtime systems chapter 1
Realtime systems chapter 1Binay Ghimire
 

What's hot (15)

Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time Systems
 
Supporting Time-Sensitive Applications on a Commodity OS
Supporting Time-Sensitive Applications on a Commodity OSSupporting Time-Sensitive Applications on a Commodity OS
Supporting Time-Sensitive Applications on a Commodity OS
 
Real Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsReal Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systems
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
Real Time Systems &amp; RTOS
Real Time Systems &amp; RTOSReal Time Systems &amp; RTOS
Real Time Systems &amp; RTOS
 
Extlect02
Extlect02Extlect02
Extlect02
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlock
 
IOT Firmware: Best Pratices
IOT Firmware:  Best PraticesIOT Firmware:  Best Pratices
IOT Firmware: Best Pratices
 
Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...
 
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
LCA14: LCA14-306: CPUidle & CPUfreq integration with schedulerLCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
 
Real-Time Operating Systems
Real-Time Operating SystemsReal-Time Operating Systems
Real-Time Operating Systems
 
Realtime systems chapter 1
Realtime systems chapter 1Realtime systems chapter 1
Realtime systems chapter 1
 

Viewers also liked

Viewers also liked (20)

Intentional code
Intentional codeIntentional code
Intentional code
 
02 performance
02 performance02 performance
02 performance
 
02 isa
02 isa02 isa
02 isa
 
00 introduction
00 introduction00 introduction
00 introduction
 
New Catfiz presentation
New Catfiz presentationNew Catfiz presentation
New Catfiz presentation
 
ROBOTS Y MÁQUINAS
ROBOTS Y MÁQUINASROBOTS Y MÁQUINAS
ROBOTS Y MÁQUINAS
 
Egészség mint versenyelőny - Munkahelyi stressz
Egészség mint versenyelőny - Munkahelyi stresszEgészség mint versenyelőny - Munkahelyi stressz
Egészség mint versenyelőny - Munkahelyi stressz
 
SV9100 foundation
SV9100 foundationSV9100 foundation
SV9100 foundation
 
Rishab Aggarwal
Rishab AggarwalRishab Aggarwal
Rishab Aggarwal
 
DBPIA 이용 매뉴얼
DBPIA 이용 매뉴얼DBPIA 이용 매뉴얼
DBPIA 이용 매뉴얼
 
Analisis tema 8 y 9
Analisis tema 8 y 9Analisis tema 8 y 9
Analisis tema 8 y 9
 
Aparelho auditivo
Aparelho auditivoAparelho auditivo
Aparelho auditivo
 
Tema: Redes Informáticas
Tema: Redes InformáticasTema: Redes Informáticas
Tema: Redes Informáticas
 
CoWorkRev
CoWorkRevCoWorkRev
CoWorkRev
 
Equipo 3
Equipo 3Equipo 3
Equipo 3
 
01åbz ¡åmigo!
01åbz ¡åmigo!01åbz ¡åmigo!
01åbz ¡åmigo!
 
6º 3 MS 5 2014 Programa de examen Biol, Gen y Soc.
6º 3 MS 5 2014 Programa de examen Biol, Gen y Soc.6º 3 MS 5 2014 Programa de examen Biol, Gen y Soc.
6º 3 MS 5 2014 Programa de examen Biol, Gen y Soc.
 
чай в україні
чай в українічай в україні
чай в україні
 
Postural management for people with MS
Postural management for people with MSPostural management for people with MS
Postural management for people with MS
 
RAJEEV RANJAN KUMAR RESUME
RAJEEV RANJAN KUMAR RESUMERAJEEV RANJAN KUMAR RESUME
RAJEEV RANJAN KUMAR RESUME
 

Similar to 03 performance

Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceHsien-Hsin Sean Lee, Ph.D.
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.pptFannyBellows
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptxdhivyak49
 
L-2 (Computer Performance).ppt
L-2 (Computer Performance).pptL-2 (Computer Performance).ppt
L-2 (Computer Performance).pptImranKhan997082
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android BenchmarksKoan-Sin Tan
 
Resource Management in (Embedded) Real-Time Systems
Resource Management in (Embedded) Real-Time SystemsResource Management in (Embedded) Real-Time Systems
Resource Management in (Embedded) Real-Time Systemsjeronimored
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptx1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptxKesavanGopal1
 
Performance Tuning by Dijesh P
Performance Tuning by Dijesh PPerformance Tuning by Dijesh P
Performance Tuning by Dijesh PPlusOrMinusZero
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginnersGerwin Makanyanga
 
39245147 intro-es-i
39245147 intro-es-i39245147 intro-es-i
39245147 intro-es-iEmbeddedbvp
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSyed Zaid Irshad
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xMatthew Gaudet
 
Unit 1 Computer organization and Instructions
Unit 1 Computer organization and InstructionsUnit 1 Computer organization and Instructions
Unit 1 Computer organization and InstructionsBalaji Vignesh
 

Similar to 03 performance (20)

Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
 
Lecture1
Lecture1Lecture1
Lecture1
 
L-2 (Computer Performance).ppt
L-2 (Computer Performance).pptL-2 (Computer Performance).ppt
L-2 (Computer Performance).ppt
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Resource Management in (Embedded) Real-Time Systems
Resource Management in (Embedded) Real-Time SystemsResource Management in (Embedded) Real-Time Systems
Resource Management in (Embedded) Real-Time Systems
 
Embedded System-design technology
Embedded System-design technologyEmbedded System-design technology
Embedded System-design technology
 
PILOT Session for Embedded Systems
PILOT Session for Embedded Systems PILOT Session for Embedded Systems
PILOT Session for Embedded Systems
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptx1. An Introduction to Embed Systems_DRKG.pptx
1. An Introduction to Embed Systems_DRKG.pptx
 
Performance Tuning by Dijesh P
Performance Tuning by Dijesh PPerformance Tuning by Dijesh P
Performance Tuning by Dijesh P
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
39245147 intro-es-i
39245147 intro-es-i39245147 intro-es-i
39245147 intro-es-i
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3x
 
Unit 1 Computer organization and Instructions
Unit 1 Computer organization and InstructionsUnit 1 Computer organization and Instructions
Unit 1 Computer organization and Instructions
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 

More from marangburu42

Hennchthree 161102111515
Hennchthree 161102111515Hennchthree 161102111515
Hennchthree 161102111515marangburu42
 
Sequential circuits
Sequential circuitsSequential circuits
Sequential circuitsmarangburu42
 
Combinational circuits
Combinational circuitsCombinational circuits
Combinational circuitsmarangburu42
 
Hennchthree 160912095304
Hennchthree 160912095304Hennchthree 160912095304
Hennchthree 160912095304marangburu42
 
Sequential circuits
Sequential circuitsSequential circuits
Sequential circuitsmarangburu42
 
Combinational circuits
Combinational circuitsCombinational circuits
Combinational circuitsmarangburu42
 
Karnaugh mapping allaboutcircuits
Karnaugh mapping allaboutcircuitsKarnaugh mapping allaboutcircuits
Karnaugh mapping allaboutcircuitsmarangburu42
 
Aac boolean formulae
Aac   boolean formulaeAac   boolean formulae
Aac boolean formulaemarangburu42
 
Virtualmemoryfinal 161019175858
Virtualmemoryfinal 161019175858Virtualmemoryfinal 161019175858
Virtualmemoryfinal 161019175858marangburu42
 
File system interfacefinal
File system interfacefinalFile system interfacefinal
File system interfacefinalmarangburu42
 
File systemimplementationfinal
File systemimplementationfinalFile systemimplementationfinal
File systemimplementationfinalmarangburu42
 
Mass storage structurefinal
Mass storage structurefinalMass storage structurefinal
Mass storage structurefinalmarangburu42
 
All aboutcircuits karnaugh maps
All aboutcircuits karnaugh mapsAll aboutcircuits karnaugh maps
All aboutcircuits karnaugh mapsmarangburu42
 
Virtual memoryfinal
Virtual memoryfinalVirtual memoryfinal
Virtual memoryfinalmarangburu42
 
Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029marangburu42
 

More from marangburu42 (20)

Hol
HolHol
Hol
 
Write miss
Write missWrite miss
Write miss
 
Hennchthree 161102111515
Hennchthree 161102111515Hennchthree 161102111515
Hennchthree 161102111515
 
Hennchthree
HennchthreeHennchthree
Hennchthree
 
Hennchthree
HennchthreeHennchthree
Hennchthree
 
Sequential circuits
Sequential circuitsSequential circuits
Sequential circuits
 
Combinational circuits
Combinational circuitsCombinational circuits
Combinational circuits
 
Hennchthree 160912095304
Hennchthree 160912095304Hennchthree 160912095304
Hennchthree 160912095304
 
Sequential circuits
Sequential circuitsSequential circuits
Sequential circuits
 
Combinational circuits
Combinational circuitsCombinational circuits
Combinational circuits
 
Karnaugh mapping allaboutcircuits
Karnaugh mapping allaboutcircuitsKarnaugh mapping allaboutcircuits
Karnaugh mapping allaboutcircuits
 
Aac boolean formulae
Aac   boolean formulaeAac   boolean formulae
Aac boolean formulae
 
Virtualmemoryfinal 161019175858
Virtualmemoryfinal 161019175858Virtualmemoryfinal 161019175858
Virtualmemoryfinal 161019175858
 
Io systems final
Io systems finalIo systems final
Io systems final
 
File system interfacefinal
File system interfacefinalFile system interfacefinal
File system interfacefinal
 
File systemimplementationfinal
File systemimplementationfinalFile systemimplementationfinal
File systemimplementationfinal
 
Mass storage structurefinal
Mass storage structurefinalMass storage structurefinal
Mass storage structurefinal
 
All aboutcircuits karnaugh maps
All aboutcircuits karnaugh mapsAll aboutcircuits karnaugh maps
All aboutcircuits karnaugh maps
 
Virtual memoryfinal
Virtual memoryfinalVirtual memoryfinal
Virtual memoryfinal
 
Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029
 

Recently uploaded

Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 

Recently uploaded (20)

Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 

03 performance

  • 2. What do you want in a computer? 2
  • 3. What do you want in a computer? • Low latency -- one unit of work in minimum time • 1/latency = responsiveness • High throughput -- maximum work per time • High bandwidth (BW) • Low cost • Low power -- minimum jules per time • Low energy -- minimum jules per work • Reliability -- Mean time to failure (MTTF) • Derived metrics • responsiveness/dollar • BW/$ • BW/Watt • Work/Jule • Energy * latency -- Energy delay product • MTTF/$ 3
  • 4. Latency • This is the simplest kind of performance • How long does it take the computer to perform a task? • The task at hand depends on the situation. • Usually measured in seconds • Also measured in clock cycles • Caution: if you are comparing two different system, you must ensure that the cycle times are the same. 4
  • 5. Measuring Latency • Stop watch! • System calls • gettimeofday() • System.currentTimeMillis() • Command line • time <command> 5
  • 6. Where latency matters • Application responsiveness • Any time a person is waiting. • GUIs • Games • Internet services (from the users perspective) • “Real-time” applications • Tight constraints enforced by the real world • Anti-lock braking systems • Manufacturing control • Multi-media applications • The cost of poor latency • If you are selling computer time, latency is money. 6
  • 7. Latency and Performance • By definition: • Performance = 1/Latency • If Performance(X) > Performance(Y), X is faster. • If Perf(X)/Perf(Y) = S, X is S times faster thanY. • Equivalently: Latency(Y)/Latency(X) = S • When we need to talk about specifically about other kinds of “performance” we must be more specific. 7
  • 8. The Performance Equation • We would like to model how architecture impacts performance (latency) • This means we need to quantify performance in terms of architectural parameters. • Instructions -- this is the basic unit of work for a processor • Cycle time -- these two give us a notion of time. • Cycles • The first fundamental theorem of computer architecture: Latency = Instructions * Cycles/Instruction * Seconds/Cycle 8
  • 9. The Performance Equation • The units work out! Remember your dimensional analysis! • Cycles/Instruction == CPI • Seconds/Cycle == 1/hz • Example: • 1GHz clock • 1 billion instructions • CPI = 4 • What is the latency? 9 Latency = Instructions * Cycles/Instruction * Seconds/Cycle
  • 10. Examples • gcc runs in 100 sec on a 1 GHz machine – How many cycles does it take? • gcc runs in 75 sec on a 600 MHz machine – How many cycles does it take? 100G cycles 45G cycles Latency = Instructions * Cycles/Instruction * Seconds/Cycle
  • 11. How can this be? • Different Instruction count? • Different ISAs ? • Different compilers ? • Different CPI? • underlying machine implementation • Microarchitecture • Different cycle time? • New process technology • Microarchitecture 11 Latency = Instructions * Cycles/Instruction * Seconds/Cycle
  • 12. Computing Average CPI • Instruction execution time depends on instruction time (we’ll get into why this is so later on) • Integer +, -, <<, |, & -- 1 cycle • Integer *, /, -- 5-10 cycles • Floating point +, - -- 3-4 cycles • Floating point *, /, sqrt() -- 10-30 cycles • Loads/stores -- variable • All theses values depend on the particular implementation, not the ISA • Total CPI depends on the workload’s Instruction mix -- how many of each type of instruction executes • What program is running? • How was it compiled? 12
  • 13. The Compiler’s Role • Compilers affect CPI… • Wise instruction selection • “Strength reduction”: x*2n -> x << n • Use registers to eliminate loads and stores • More compact code -> less waiting for instructions • …and instruction count • Common sub-expression elimination • Use registers to eliminate loads and stores 13
  • 14. Stupid Compiler int i, sum = 0; for(i=0;i<10;i++) sum += i; sw 0($sp), $0 #sum = 0 sw 4($sp), $0 #i = 0 loop: lw $1, 4($sp) sub $3, $1, 10 beq $3, $0, end lw $2, 0($sp) add $2, $2, $1 st 0($sp), $2 addi $1, $1, 1 st 4($sp), $1 b loop end: Type CPI Static # dyn # mem 5 6 42 int 1 3 30 br 1 2 20 Total 2.8 11 92 (5*42 + 1*30 + 1*20)/92 = 2.8
  • 15. Smart Compiler int i, sum = 0; for(i=0;i<10;i++) sum += i; add $1, $0, $0 # i add $2, $0, $0 # sum loop: sub $3, $1, 10 beq $3, $0, end add $2, $2, $1 addi $1, $1, 1 b loop end: sw 0($sp), $2 Type CPI Static # dyn # mem 5 1 1 int 1 5 32 br 1 2 20 Total 1.01 8 53 (5*1 + 1*32 + 1*20)/53 = 2.8
  • 17. Program inputs affect CPI too! int rand[1000] = {random 0s and 1s } for(i=0;i<1000;i++) if(rand[i]) sum -= i; else sum *= i; int ones[1000] = {1, 1, ...} for(i=0;i<1000;i++) if(ones[i]) sum -= i; else sum *= i; • Data-dependent computation • Data-dependent micro-architectural behavior –Processors are faster when the computation is predictable (more later)
  • 19. • Meaningful CPI exists only: • For a particular program with a particular compiler • ....with a particular input. • You MUST consider all 3 to get accurate latency estimations or machine speed comparisons • Instruction Set • Compiler • Implementation of Instruction Set (386 vs Pentium) • Processor Freq (600 Mhz vs 1 GHz) • Same high level program with same input • “wall clock” measurements are always comparable. • If the workloads (app + inputs) are the same 19 Making Meaningful Comparisons Latency = Instructions * Cycles/Instruction * Seconds/Cycle
  • 20. The Performance Equation • Clock rate = • Instruction count = • Latency = • Find the CPI! 20 Latency = Instructions * Cycles/Instruction * Seconds/Cycle
  • 21. Today • DRAM • Quiz 1 recap • HW 1 recap • Questions about ISAs • More about the project? • Amdahl’s law 21
  • 22. Key Points • Amdahl’s law and how to apply it in a variety of situations • It’s role in guiding optimization of a system • It’s role in determining the impact of localized changes on the entire system • 22
  • 23. Limits on Speedup: Amdahl’s Law • “The fundamental theorem of performance optimization” • Coined by Gene Amdahl (one of the designers of the IBM 360) • Optimizations do not (generally) uniformly affect the entire program – The more widely applicable a technique is, the more valuable it is – Conversely, limited applicability can (drastically) reduce the impact of an optimization. Always heed Amdahl’s Law!!! It is central to many many optimization problems
  • 24. Amdahl’s Law in Action • SuperJPEG-O-Rama2000 ISA extensions ** –Speeds up JPEG decode by 10x!!! –Act now! While Supplies Last! ** Increases processor cost by 45%
  • 25. Amdahl’s Law in Action • SuperJPEG-O-Rama2000 in the wild • PictoBench spends 33% of it’s time doing JPEG decode • How much does JOR2k help? JPEG Decodew/o JOR2k w/ JOR2k 30s 21s
  • 26. Amdahl’s Law in Action • SuperJPEG-O-Rama2000 in the wild • PictoBench spends 33% of it’s time doing JPEG decode • How much does JOR2k help? JPEG Decodew/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.4x Speedup != 10x
  • 27. Amdahl’s Law in Action • SuperJPEG-O-Rama2000 in the wild • PictoBench spends 33% of it’s time doing JPEG decode • How much does JOR2k help? JPEG Decodew/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.4x Speedup != 10x Is this worth the 45% increase in cost?
  • 28. Amdahl’s Law in Action • SuperJPEG-O-Rama2000 in the wild • PictoBench spends 33% of it’s time doing JPEG decode • How much does JOR2k help? JPEG Decodew/o JOR2k w/ JOR2k 30s 21s Performance: 30/21 = 1.4x Speedup != 10x Is this worth the 45% increase in cost? Amdahl ate our Speedup!
  • 29. Amdahl’s Law • The second fundamental theorem of computer architecture. • If we can speed up X of the program by S times • Amdahl’s Law gives the total speed up, Stot Stot = 1 . (x/S + (1-x))
  • 30. Amdahl’s Law • The second fundamental theorem of computer architecture. • If we can speed up X of the program by S times • Amdahl’s Law gives the total speed up, Stot Stot = 1 . (x/S + (1-x)) x =1 => Stot = 1 = 1 = S (1/S + (1-1)) 1/S Sanity check:
  • 31. Amdahl’s Corollary #1 • Maximum possible speedup, Smax Smax = 1 (1-x) S = infinity
  • 32. Amdahl’s Law Practice • Protein String Matching Code –200 hours to run on current machine, spends 20% of time doing integer instructions –How much faster must you make the integer unit to make the code run 10 hours faster? –How much faster must you make the integer unit to make the code run 50 hours faster? A)1.1 B)1.25 C)1.75 D)1.33 E) 10.0 F) 50.0 G) 1 million times H) Other
  • 33. Amdahl’s Law Practice • Protein String Matching Code –4 days ET on current machine • 20% of time doing integer instructions • 35% percent of time doing I/O –Which is the better economic tradeoff? • Compiler optimization that reduces number of integer instructions by 25% (assume each integer inst takes the same amount of time) • Hardware optimization that makes I/O run 20% faster?
  • 34. Amdahl’s Law Applies All Over 30 • SSDs use 10x less power than HDs • But they only save you ~50% overall.
  • 35. Amdahl’s Law in Memory 31 Memory Device Rowdecoder Column decoder Sense Amps High order bits Low order bits Storage array DataAddress • Storage array 90% of area • Row decoder 4% • Column decode 2% • Sense amps 4% • What’s the benefit of reducing bit size by 10%? • Reducing column decoder size by 90%?
  • 36. Amdahl’s Corollary #2 • Make the common case fast (i.e., x should be large)! –Common == “most time consuming” not necessarily “most frequent” –The uncommon case doesn’t make much difference –Be sure of what the common case is –The common case changes. • Repeat… –With optimization, the common becomes uncommon and vice versa.
  • 37. Amdahl’s Corollary #2: Example Common case
  • 38. Amdahl’s Corollary #2: Example Common case 7x => 1.4x
  • 39. Amdahl’s Corollary #2: Example Common case 7x => 1.4x 4x => 1.3x
  • 40. Amdahl’s Corollary #2: Example Common case 7x => 1.4x 4x => 1.3x 1.3x => 1.1x Total = 20/10 = 2x
  • 41. Amdahl’s Corollary #2: Example Common case 7x => 1.4x 4x => 1.3x 1.3x => 1.1x Total = 20/10 = 2x • In the end, there is no common case! • Options: – Global optimizations (faster clock, better compiler) – Find something common to work on (i.e. memory latency) – War of attrition – Total redesign (You are probably well-prepared for this)
  • 42. Amdahl’s Corollary #3 • Benefits of parallel processing • p processors • x% is p-way parallizable • maximum speedup, Spar Spar = 1 . (x/p + (1-x))
  • 43. Amdahl’s Corollary #3 • Benefits of parallel processing • p processors • x% is p-way parallizable • maximum speedup, Spar Spar = 1 . (x/p + (1-x)) x is pretty small for desktop applications, even for p = 2
  • 44. Amdahl’s Corollary #3 • Benefits of parallel processing • p processors • x% is p-way parallizable • maximum speedup, Spar Spar = 1 . (x/p + (1-x)) x is pretty small for desktop applications, even for p = 2 Does Intel’s 80-core processor make much sense?
  • 45. Amdahl’s Corollary #4 • Amdahl’s law for latency (L) Lnew = Lbase *1/Speedup Lnew = Lbase *(x/S + (1-x)) Lnew = (Lbase /S)*x + ETbase*(1-x) • If you can speed up y% of the remaining (1-x), you can apply Amdahl’s law recursively Lnew = (Lbase /S1)*x + (Sbase*(1-x)/S2*y + Lbase*(1-x)*(1-y)) • This is how we will analyze memory system performance
  • 46. Amdahl’s Non-Corollary • Amdahl’s law does not bound slowdown Lnew = (Lbase /S)*x + Lbase*(1-x) • Lnew is linear in 1/S • Example: x = 0.01 of execution, Lbase = 1 –S = 0.001; • Enew = 1000*Lbase *0.01 + Lbase *(0.99) ~ 10*Lbase –S = 0.00001; • Enew = 100000*Lbase *0.01 + Lbase *(0.99) ~ 1000*Lbase • Things can only get so fast, but they can get arbitrarily slow. –Do not hurt the non-common case too much!
  • 47. Benchmarks: Standard Candles for Performance • It’s hard to convince manufacturers to run your program (unless you’re a BIG customer) • A benchmark is a set of programs that are representative of a class of problems. • To increase predictability, collections of benchmark applications, called benchmark suites, are popular – “Easy” to set up – Portable – Well-understood – Stand-alone – Standardized conditions – These are all things that real software is not.
  • 48. Classes of benchmarks • Microbenchmark – measure one feature of system – e.g. memory accesses or communication speed • Kernels – most compute-intensive part of applications – e.g. Linpack and NAS kernel b’marks (for supercomputers) • Full application: – SpecInt / SpecFP (int and float) (for Unix workstations) – Other suites for databases, web servers, graphics,...
  • 49. Bandwidth • The amount of work (or data) per time • MB/s, GB/s -- network BW, disk BW, etc. • Frames per second -- Games, video transcoding • (why are games under both latency and BW?) • Also called “throughput” 39
  • 50. Measuring Bandwidth • Measure how much work is done • Measure latency • Divide 40
  • 51. Latency-BW Trade-offs • Often, increasing latency for one task and increase BW for many tasks. • Think of waiting in line for one of 4 bank tellers • If the line is empty, your response time is minimized, but throughput is low because utilization is low. • If there is always a line, you wait longer (your latency goes up), but there is always work available for tellers. • Much of computer performance is about scheduling work onto resources • Network links. • Memory ports. • Processors, functional units, etc. • IO channels. • Increasing contention for these resources generally increases throughput but hurts latency. 41
  • 52. Stationwagon Digression • IPv6 Internet 2: 272,400 terabit-meters per second –585GB in 30 minutes over 30,000 Km –9.08 Gb/s • Subaru outback wagon – Max load = 408Kg – 21Mpg • MHX2 BT 300 Laptop drive – 300GB/Drive – 0.135Kg • 906TB • Legal speed: 75MPH (33.3 m/s) • BW = 8.2 Gb/s • Latency = 10 days • 241,535 terabit-meters per second
  • 53. Prius Digression • IPv6 Internet 2: 272,400 terabit-meters per second –585GB in 30 minutes over 30,000 Km –9.08 Gb/s • My Toyota Prius – Max load = 374Kg – 44Mpg (2x power efficiency) • MHX2 BT 300 – 300GB/Drive – 0.135Kg • 831TB • Legal speed: 75MPH (33.3 m/s) • BW = 7.5 Gb/s • Latency = 10 days • 221,407 terabit-meters per second (13% performance hit)