SlideShare a Scribd company logo
1 of 47
Download to read offline
1
Computer Architecture
Performance and Energy
Consumption
Computer Architecture
CA: Performance and Power
2
Defining Performance
n Which airplane has the best performance?
CA: Performance and Power
3
Response Time and Throughput
n Response time
n How long it takes to do a task
n Throughput
n Total work done per unit time
n e.g., tasks/transactions/… per hour
n How are response time and throughput affected
by
n Replacing the processor with a faster version?
n Adding more processors?
n We’ll focus on response time for now…
CA: Performance and Power
4
Relative Performance
n Define Performance = 1/Execution Time
n “X is n time faster than Y”
n
=
= X
Y
Y
X
time
Execution
time
Execution
e
Performanc
e
Performanc
n Example: time taken to run a program
n 10s on A, 15s on B
n Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
n So A is 1.5 times faster than B
CA: Performance and Power
5
Measuring Execution Time
n Elapsed time
n Total response time, including all aspects
n Processing, I/O, OS overhead, idle time
n Determines system performance
n CPU time
n Time spent processing a given job
n Discounts I/O time, other jobs’ shares
n Comprises user CPU time and system CPU
time
n Different programs are affected differently by
CPU and system performance
CA: Performance and Power
6
Analyze the Right Measurement!
CA: Performance and Power
Guides CPU design
Guides system design
CPU Time:
Measuring CPU time
(Ubuntu):
$ time <program name>
real 0m0.095s
user 0m0.013s
sys 0m0.008s
Real elapsed
time (in minutes
and seconds) Time the CPU
spends running
program under
measurement
Total response time =
CPU time + time spent
waiting (for disk, I/O, ..)
7
CPU Clocking
n Operation of digital hardware governed by a
constant-rate clock
Clock (cycles)
Data transfer
and computation
Update state
Clock period
n Clock period: duration of a clock cycle
n e.g., 250ps = 0.25ns = 250×10–12 s
n Clock frequency (rate): cycles per second
n e.g., 4.0GHz = 4000MHz = 4.0×109 Hz
CA: Performance and Power
8
CPU Time
n Performance improved by
n Reducing number of clock cycles
n Increasing clock rate
n Hardware designer must often trade off clock
rate against cycle count
Rate
Clock
Cycles
Clock
CPU
Time
Cycle
Clock
Cycles
Clock
CPU
Time
CPU
=
´
=
CA: Performance and Power
9
CPU Time Example
n Computer A: 2GHz clock, 10s CPU time
n Designing Computer B
n Aim for 6s CPU time
n Can do faster clock, but causes 1.2 × clock cycles
n How fast must Computer B clock be?
4GHz
6s
10
24
6s
10
20
1.2
Rate
Clock
10
20
2GHz
10s
Rate
Clock
Time
CPU
Cycles
Clock
6s
Cycles
Clock
1.2
Time
CPU
Cycles
Clock
Rate
Clock
9
9
B
9
A
A
A
A
B
B
B
=
´
=
´
´
=
´
=
´
=
´
=
´
=
=
CA: Performance and Power
10
Instruction Count and CPI
n Instruction Count (IC) for a program
n Determined by program, ISA and compiler
n Average cycles per instruction (CPI)
n Depends on program, CPU hardware and compiler
n If different instructions have different CPI
n Average CPI affected by instruction mix
Rate
Clock
CPI
Count
n
Instructio
Time
Cycle
Clock
CPI
Count
n
Instructio
Time
CPU
n
Instructio
per
Cycles
Count
n
Instructio
Cycles
Clock
´
=
´
´
=
´
=
CA: Performance and Power
11
CPI Example
n Computer A: Cycle Time = 250ps, CPI = 2.0
n Computer B: Cycle Time = 500ps, CPI = 1.2
n Same ISA
n Which is faster, and by how much?
1.2
500ps
I
600ps
I
A
Time
CPU
B
Time
CPU
600ps
I
500ps
1.2
I
B
Time
Cycle
B
CPI
Count
n
Instructio
B
Time
CPU
500ps
I
250ps
2.0
I
A
Time
Cycle
A
CPI
Count
n
Instructio
A
Time
CPU
=
´
´
=
´
=
´
´
=
´
´
=
´
=
´
´
=
´
´
=
A is faster…
…by this much
CA: Performance and Power
12
How Calculate the 3 Components?
n Clock Cycle Time: in specification of computer
(Clock Rate in advertisements)
n Instruction Count:
n Count instructions in loop of small program
n Use simulator to count instructions
n Hardware counter in spec. register (most CPUs)
°CPI:
• Calculate: Execution Time / Clock cycle time
Instruction Count
• Hardware counter in special register (most CPUs)
CA: Performance and Power
13
CPI in More Detail
n If different instruction classes take different
numbers of cycles
ĂĽ
=
´
=
n
1
i
i
i )
Count
n
Instructio
(CPI
Cycles
Clock
n Weighted average CPI
ĂĽ
=
á
ø
Ăś
ç
è
ĂŚ
´
=
=
n
1
i
i
i
Count
n
Instructio
Count
n
Instructio
CPI
Count
n
Instructio
Cycles
Clock
CPI
Relative frequency
CA: Performance and Power
14
Calculating Average CPI
n First read CPIi for each individual instruction
(add, sub, and, etc.)
n Next use when it’s given or calculate relative
frequency fi of each individual instruction
n Finally multiply these two for each
instruction and add them up to get final CPI
𝐶𝑃𝐼 = %
!"#
$
𝑓𝑖 ∗ 𝐶𝑃𝐼𝑖
𝑓𝑖 = 𝐼𝐶𝑖/𝐼𝐶
CA: Performance and Power
15
Example with Bonus Points to Earn!
Op Freqi CPIi Prod (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 15% 2 .3 (20%)
Branch 15% 2 .3 (20%)
1.5
• What if you can make branch instructions twice as fast (CPIbr= 1 cycle)
but clock rate (CR) will decrease by 12%? Will it be a speedup or slowdown
and how much?
Instruction Mix (Where time spent)
CA: Performance and Power
New CPI = 1.35
Time before change = IC*1.5 / CR = IC*1.5/CR
Time after change = IC*1.35 / (0.88*CR) = IC*1.534/CR
Time before/Time after = IC*1.5*CR / IC*1.534*CR = 0.978 => 2.2% slowdown
Speedup < 1 because the time after is greater than the time before the change
16
Another CPI Example
n Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
n Sequence 1: IC = 5
n Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
n Avg. CPI = 10/5 = 2.0
n Sequence 2: IC = 6
n Clock Cycles
= 4×1 + 1×2 + 1×3
= 9
n Avg. CPI = 9/6 = 1.5
CA: Performance and Power
17
CPU Performance Law
n Performance depends on
n Algorithm: affects IC, possibly CPI
n Programming language: affects IC, CPI
n Compiler: affects IC, CPI
n Instruction set architecture: affects IC, CPI, Tc
n Processor microarchitecture (organization): affects CPI, Tc
n Technology: affects Tc
The BIG Picture
Rate
Clock
CPI
Count
n
Instructio
Time
Cycle
Clock
CPI
Count
n
Instructio
Time
CPU
n
Instructio
per
Cycles
Count
n
Instructio
Cycles
Clock
´
=
´
´
=
´
= cycle
Clock
Seconds
n
Instructio
cycles
Clock
Program
ns
Instructio
Time
CPU ´
´
=
CA: Performance and Power
18
What Programs Measure for Comparison?
n Ideally run typical programs with typical input
before purchase,
or before even build machine
n Called a “workload”; For example:
n Engineer uses compiler, spreadsheet
n Author uses word processor, drawing program,
compression software
n In some situations it’s hard to do
n Don’t have access to machine to “benchmark”
before purchase
n Don’t know workload in future
CA: Performance and Power
19
SPEC CPU Benchmarks
n Standard Performance Evaluation Corporation
(SPEC) www.spec.org
n Elapsed time to execute a selection of programs
n Negligible I/O, so focuses on CPU performance
n SPEC95: 8 integer (gcc, compress, li, ijpeg, perl, ...) & 10
floating-point (FP) programs (hydro2d, mgrid, applu, turbo3d, ...)
n SPEC2000: 11 integer (gcc, bzip2, …), 18 FP (mgrid, swim,
ma3d, …)
n Separate average for integer and FP
n Benchmarks distributed in source code
n Compiler, machine designers target benchmarks, so try to
change every 3 years
CA: Performance and Power
20
How Summarize Suite Performance (1/4)
n Arithmetic average of execution time of all programs?
n But they vary by 4X in speed, so some would be
more important than others in arithmetic average
n Could add a weights per program, but how pick
weight?
n Different companies want different weights for their
products
n SPECRatio: Normalize execution times to reference
computer, yielding a ratio proportional to performance
=
time on reference computer
time on computer being rated
CA: Performance and Power
21
How Summarize Suite Performance (2/4)
n If program SPECRatio on Computer A is
1.25 times bigger than Computer B, then
B
A
A
B
B
reference
A
reference
B
A
e
Performanc
e
Performanc
ime
ExecutionT
ime
ExecutionT
ime
ExecutionT
ime
ExecutionT
ime
ExecutionT
ime
ExecutionT
SPECRatio
SPECRatio
=
=
=
=
25
.
1
• Note that when comparing 2 computers as a ratio, execution
times on the reference computer drop out, so choice of reference
computer is irrelevant
CA: Performance and Power
22
How Summarize Suite Performance (3/4)
n Since ratios, proper mean is geometric mean
(SPECRatio unitless, so arithmetic mean meaningless)
n
n
i
i
SPECRatio
ean
GeometricM Õ
=
=
1
1. Geometric mean of the ratios is the same as the ratio of the
geometric means
2. Ratio of geometric means
= Geometric mean of performance ratios
Þ choice of reference computer is irrelevant!
• These two points make geometric mean of ratios attractive to
summarize performance
CA: Performance and Power
23
SPECspeed 2017 Integer Benchmarks on a
1.8 GHz Intel Xeon E5-2650L
CA: Performance and Power
24
Analysis: CT involves breaking down complex ideas, arguments, or situations into smaller components to understand their
structure and relationships.
Evaluation: It entails assessing the validity, reliability, and credibility of information, sources, and arguments. This involves
considering evidence, biases, and potential shortcomings.
Interpretation: CT involves interpreting information and data to derive meaning, identify patterns, and draw conclusions.
Inference: It includes making reasonable and well-supported conclusions based on available evidence and information.
Problem Solving: CT aids in solving problems by approaching them systematically, considering different perspectives, and
evaluating potential solutions.
Contextualization: It requires considering the broader context and implications of information, ideas, and decisions.
Communication: CT involves effectively conveying ideas, arguments, and analyses to others and engaging in constructive
discussions.
Skepticism: Critical thinkers approach information with a healthy dose of skepticism,
questioning assumptions and seeking evidence.
Open-Mindedness: It involves being open to different viewpoints, considering alternative explanations, and being willing to
change one's views in light of new evidence.
Reflection: It includes reflecting on one's own thinking processes, biases, & assumptions.
Critical thinking (CT) involves several key elements (in the
words of AI itself):
What’s at stake for you here
Is a single mean a good predictor of the performance of programs in
benchmark suite?
How much confidence you can have when using it to compare different
processors or to predict future performance on similar apps?
It's tough to make predictions, especially about the future.
Yogi Berra
25
How Summarize Suite Performance (4/4)
n Does a single mean well summarize performance of programs in
benchmark suite?
n Can decide if mean a good predictor by characterizing variability
of distribution using standard deviation
n Like geometric mean, geometric standard deviation is multiplicative
rather than arithmetic
n Can simply take the logarithm of SPECRatios, compute the standard
mean and standard deviation, and then take the exponent to convert
back:
n The geometric standard deviation, denoted by σg, is calculated as
follows: log σg=[1/n∑n
i=1(logxi−logG)2]1/2.
n where G=n√x1⋅x2⋅…⋅xn is the geometric mean of SPECRatios (x1 . xn).
( )
( )
( )
( )
i
n
i
i
SPECRatio
StDev
tDev
GeometricS
SPECRatio
n
ean
GeometricM
ln
exp
ln
1
exp
1
=
á
ø
Ăś
ç
è
ĂŚ
´
= ĂĽ
=
CA: Performance and Power
26
Example Standard Deviation: (1/3)
0
2000
4000
6000
8000
10000
12000
14000
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
SPECfpRatio
1372
5362
2712
GM = 2712
GStDev = 1.98
• GM and multiplicative StDev of SPECfp2000 for Itanium 2
Outside 1 StDev
Itanium 2 is
2712/100 times
as fast as Sun
Ultra 5 (GM), &
range within 1
Std. Deviation is
[13.72, 53.62]
CA: Performance and Power
27
Example Standard Deviation : (2/3)
• GM and multiplicative StDev of SPECfp2000 for AMD Athlon
0
2000
4000
6000
8000
10000
12000
14000
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
SPECfpRatio
1494
2911
2086
GM = 2086
GStDev = 1.40
Outside 1 StDev
Athon is
2086/100 times
as fast as Sun
Ultra 5 (GM), &
range within 1
Std. Deviation is
[14.94, 29.11]
CA: Performance and Power
28
Example Standard Deviation (3/3)
-
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
Ratio
Itanium
2
v.
Athlon
for
SPECfp2000
0.75
2.27
1.30
GM = 1.30
GStDev = 1.74
• GM and StDev Itanium 2 v Athlon
Outside 1 StDev
Ratio execution times (At/It) =
Ratio of SPECratios (It/At)
Itanium 2 1.30X Athlon (GM),
1 St.Dev. Range [0.75,2.27]
CA: Performance and Power
29
Comments on Itanium 2 and Athlon
n Standard deviation of 1.98 for Itanium 2 is much
higher-- vs. 1.40--so results will differ more
widely from the mean, and therefore are likely
less predictable for Itanium 2
n Falling within one standard deviation:
n 10 of 14 benchmarks (71%) for Itanium 2
n 11 of 14 benchmarks (78%) for Athlon
n Thus, the results are quite compatible with a
lognormal distribution (expect 68%)
n Itanium 2 vs. Athlon St.Dev is 1.74, which is high,
so less confidence in claim that Itanium 1.30
times as fast as Athlon
n Indeed, Athlon faster on 6 of 14 programs
CA: Performance and Power
30
Amdahl’s Law
( )
enhanced
enhanced
enhanced
Enh.
w/
Enh.
w/o
overall
Speedup
Fraction
Fraction
1
1
ExTime
ExTime
Speedup
+
-
=
=
Best you could ever hope to do:
( )
enhanced
maximum
Fraction
-
1
1
Speedup =
( ) Ăş
Ăť
Ăš
ĂŞ
ĂŤ
ĂŠ
+
-
´
=
enhanced
enhanced
enhanced
Enh.
w/o
Enh.
w/
Speedup
Fraction
Fraction
1
ExTime
ExTime
F
CA: Performance and Power
31
Amdahl’s Law Example
n New CPU 10X faster
n I/O bound server, so 60% time waiting for I/O
( )
( )
56
.
1
64
.
0
1
10
0.4
0.4
1
1
Speedup
Fraction
Fraction
1
1
Speedup
enhanced
enhanced
enhanced
overall
=
=
+
-
=
+
-
=
• Apparently, its human nature to be attracted by 10X
faster, vs. keeping in perspective its just 1.6X faster
CA: Performance and Power
32
Question
n Speedup = 1
(1 - F) + F
S
Question: Suppose a program spends 80% of its
time in a square root routine. How much must you
speed up square root to make the program run 5
times faster?
10
(A)
20
(B)
None of the above
(C) 100
(D)
CA: Performance and Power
33
Consequence of Amdahl’s Law
n The amount of speedup that can be achieved
through parallelism is limited by the non-parallel
portion of your program!
Speedup
Number of Processors
Parallel
portion
Serial
portion
Time
Number of Processors
1 2 3 4 5
CA: Performance and Power
34
Parallel Speed-up Examples (1/2)
n 10 “scalar” operations (non-parallelizable)
n 100 parallelizable operations
n Say, element-wise addition of two 10x10 matrices.
n 110 operations
n 100/110 = .909 Parallelizable, 10/110 = 0.091 Scalar
Z1 + Z2 + … + Z10 X1,1 X1,10
X10,1 X10,10
Y1,1 Y1,10
Y10,1 Y10,10
+
Non-parallel part Parallel part
Partition 10 ways
and perform
on 10 parallel
processing units
.
.
.
.
.
.
CA: Performance and Power
35
Parallel Speed-up Examples (2/2)
n Consider summing 10 scalar variables and two
10 by 10 matrices (matrix sum) on 10
processors
Speedup = 1/(.091 + .909/10) = 1/0.1819 = 5.5
n What if there are 100 processors ?
Speedup = 1/(.091 + .909/100) = 1/0.10009 = 10.0
n What if the matrices are 100 by 100 (or 10,010
adds in total) on 10 processors?
Speedup = 1/(.001 + .999/10) = 1/0.1009 = 9.9
n What if there are 100 processors ?
Speedup = 1/(.001 + .999/100) = 1/0.01099 = 91
Speedup w/ E = 1 / [ (1-F) + F/S ]
CA: Performance and Power
36
Strong and Weak Scaling
n To get good speedup on a multiprocessor while keeping
the problem size fixed is harder than getting good
speedup by increasing the size of the problem
n Strong scaling: When speedup is achieved on a
parallel processor without increasing the size of the
problem
n Weak scaling: When speedup is achieved on a
parallel processor by increasing the size of the
problem proportionally to the increase in the number
of processors (Gustafson's law)
n Load balancing is another important factor: every
processor doing same amount of work
n Just 1 unit with twice the load of others cuts speedup
almost in half (bottleneck!)
CA: Performance and Power
37
Other Performance Metrics
n MIPS – Million Instructions Per Second
• MFLOPS - Million Floating-point Operations
Per Second
6
10
)
(
_
´
=
s
Time
count
n
Instructio
MIPS
6
10
)
(
/
_
_
´
=
s
Time
program
ops
point
Floating
MFLOPS
• PetaFLOPS - 1015 Floating-point Operations
Per Second
15
10
)
(
/
_
_
´
=
s
Time
program
ops
point
Floating
PFLOPS
CA: Performance and Power
38
Pitfall: MIPS as a Performance Metric
n MIPS: Millions of Instructions Per Second
n Doesn’t account for
n Differences in ISAs between computers
n Differences in complexity between instructions
6
6
6
10
CPI
rate
Clock
10
rate
Clock
CPI
count
n
Instructio
count
n
Instructio
10
time
Execution
count
n
Instructio
MIPS
´
=
´
´
=
´
=
n CPI varies between programs on a given CPU
CA: Performance and Power
39
Uniprocessor Performance
Constrained by power, instruction-level parallelism,
memory latency
CA: Performance and Power
40
Multiprocessors
n Multicore microprocessors
n More than one processor per chip
n Requires explicitly parallel programming
n Compare with instruction level parallelism
n Hardware executes multiple instructions at once
n Hidden from the programmer
n Hard to do
n Programming for performance
n Load balancing
n Optimizing communication and synchronization
CA: Performance and Power
41
Supercomputing
▪ Today: clusters of multi-core CPUs + GPUs
▪ Frontier: The Exascale-class HPE Cray EX Supercomputer at Oak Ridge
National Laboratory (the fastest supercomputer in the world in 2023)
▪ 9,472 AMD Epyc 7453s "Trento" 64 core 2 GHz CPUs (606,208 cores) and
37,888 Instinct MI250X GPUs (8,335,360 cores).
▪ the most efficient supercomputer: 62.68 gigaflops/watt.
Space 680 m
2
(7,300 sq ft)
Speed 1.194 exaFLOPS (Rmax) /
1.67982 exaFLOPS (Rpeak)
Cost US$600 million (est. cost)
Purpose Scientific research and
development
CA: Performance and Power
42
Top 5 Supercomputers (TOP500, June 2023)
RankSystem Cores
Rmax
(PFlop/s)
Rpeak
(PFlop/s)
Power (kW)
1 Frontier - HPE Cray EX235a, AMD Optimized 3rd
Generation EPYC 64C 2GHz, AMD Instinct MI250X,
Slingshot-11, HPE
DOE/SC/Oak Ridge National Laboratory
United States
8,699,904 1,194.00 1,679.82 22,703
2 Supercomputer Fugaku - Supercomputer Fugaku,
A64FX 48C 2.2GHz, Tofu interconnect D, Fujitsu
RIKEN Center for Computational Science
Japan
7,630,848 442.01 537.21 29,899
3 LUMI - HPE Cray EX235a, AMD Optimized 3rd
Generation EPYC 64C 2GHz, AMD Instinct MI250X,
Slingshot-11, HPE
EuroHPC/CSC
Finland
2,220,288 309.10 428.70 6,016
4 Leonardo - BullSequana XH2000, Xeon Platinum 8358
32C 2.6GHz, NVIDIA A100 SXM4 64 GB, Quad-rail
NVIDIA HDR100 Infiniband, Atos
EuroHPC/CINECA
Italy
1,824,768 238.70 304.47 7,404
5 Summit - IBM Power System AC922, IBM POWER9 22C
3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR
Infiniband, IBM
DOE/SC/Oak Ridge National Laboratory
United States
2,414,592 148.60 200.79 10,096
CA: Performance and Power
43
Power Trends
n In CMOS IC technology
Frequency
Voltage
load
Capacitive
Power 2
´
´
=
×1000
×40 5V → 1V
n Intel 80386
consumed ~ 2 W
n 3.3 GHz Intel
Core i7
consumes 130 W
n Heat must be
dissipated from
1.5 x 1.5 cm chip
n This is the limit of
what can be
cooled by air
CA: Performance and Power
44
Energy and Power
n Dynamic energy
n Transistor switch from 0 -> 1 or 1 -> 0
n ½ x Capacitive load x Voltage2
n Dynamic power
n ½ x Capacitive load x Voltage2 x Frequency switched
n Reducing clock rate reduces power, not
energy
n Static power consumption
n Currentstatic x Voltage
n Scales with number of transistors
n To reduce: power gating
CA: Performance and Power
45
Switching Energy: Fundamental Physics
Every logic transition dissipates energy.
How can we
limit switching
energy?
(1) Reduce # of clock transitions. But we have work to do ...
(2) Reduce Vdd. But lowering Vdd limits the clock speed ...
(3) Fewer circuits. But more transistors can do more work.
(4) Reduce C per node. One reason why we scale processes.
V
dd
1
2
C V
dd
E0->1
=
2
V
dd
1
2
C V
dd
E1->0
=
2
C
Strong result: Independent of technology.
CA: Performance and Power
46
0V =
Second Factor: Leakage Currents
Even when a logic gate isn’t switching, it burns power.
Igate: Ideal switches have
zero DC current. But modern
transistor gates are a few atoms
thick, and are not ideal.
Isub: Even when this nFet
is off, it passes an Ioff
leakage current.
We can engineer any Ioff
we like, but a lower Ioff also
results in a lower Ion, and thus a
lower maximum clock speed.
Intel’s 2006 processor designs,
leakage vs switching power
A lot of work was
done to get a ratio
this good ... 50/50
is common.
CA: Performance and Power
47
Example of Quantifying Power
n Suppose 15% reduction in voltage
results in a 15% reduction in frequency.
What is impact on dynamic power?
dynamic
dynamic
dynamic
OldPower
OldPower
witched
FrequencyS
Voltage
Load
Capacitive
witched
FrequencyS
Voltage
Load
Capacitive
Power
´
´
´
´
´
´
´
´
´
Âť
=
´
=
=
6
.
0
)
85
(.
)
85
(.
85
.
2
/
1
2
/
1
3
2
2
CA: Performance and Power

More Related Content

Similar to Computer Architecture Performance and Energy

performance
performanceperformance
performancemanogallery
 
Lecture 3
Lecture 3Lecture 3
Lecture 3Mr SMAK
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6Shah Zaib
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceHsien-Hsin Sean Lee, Ph.D.
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)swapnac12
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisShah Zaib
 
02 intel v_tune_session_02
02 intel v_tune_session_0202 intel v_tune_session_02
02 intel v_tune_session_02Vivek chan
 
Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011Priyank Pathak
 
Aca11 bk2 ch9
Aca11 bk2 ch9Aca11 bk2 ch9
Aca11 bk2 ch9Sumit Mittu
 
Fundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm EfficiencyFundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm EfficiencySaranya Natarajan
 
COCOMO MODEL
COCOMO MODELCOCOMO MODEL
COCOMO MODELmovie_2009
 
ppt-U3 - (Programming, Memory Interfacing).pptx
ppt-U3 - (Programming, Memory Interfacing).pptxppt-U3 - (Programming, Memory Interfacing).pptx
ppt-U3 - (Programming, Memory Interfacing).pptxRaviKiranVarma4
 
slides_metrics.pptx
slides_metrics.pptxslides_metrics.pptx
slides_metrics.pptxSanehaGill
 
LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)Linaro
 
performance evaluation of parallel processors.pptx
performance evaluation of parallel processors.pptxperformance evaluation of parallel processors.pptx
performance evaluation of parallel processors.pptxnivedita murugan
 

Similar to Computer Architecture Performance and Energy (20)

performance
performanceperformance
performance
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
 
02 intel v_tune_session_02
02 intel v_tune_session_0202 intel v_tune_session_02
02 intel v_tune_session_02
 
Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011
 
1571 mean
1571 mean1571 mean
1571 mean
 
Aca11 bk2 ch9
Aca11 bk2 ch9Aca11 bk2 ch9
Aca11 bk2 ch9
 
Fundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm EfficiencyFundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm Efficiency
 
Mini Project- Dual Processor Computation
Mini Project- Dual Processor ComputationMini Project- Dual Processor Computation
Mini Project- Dual Processor Computation
 
COCOMO MODEL
COCOMO MODELCOCOMO MODEL
COCOMO MODEL
 
ppt-U3 - (Programming, Memory Interfacing).pptx
ppt-U3 - (Programming, Memory Interfacing).pptxppt-U3 - (Programming, Memory Interfacing).pptx
ppt-U3 - (Programming, Memory Interfacing).pptx
 
slides_metrics.pptx
slides_metrics.pptxslides_metrics.pptx
slides_metrics.pptx
 
LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)
 
performance evaluation of parallel processors.pptx
performance evaluation of parallel processors.pptxperformance evaluation of parallel processors.pptx
performance evaluation of parallel processors.pptx
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 

More from Jason J Pulikkottil

Overview of computers and programming
Overview of computers and programmingOverview of computers and programming
Overview of computers and programmingJason J Pulikkottil
 
Repetition, Basic loop structures, Loop programming techniques
Repetition, Basic loop structures, Loop programming techniquesRepetition, Basic loop structures, Loop programming techniques
Repetition, Basic loop structures, Loop programming techniquesJason J Pulikkottil
 
ARM 7 and 9 Core Architecture Illustration
ARM 7 and 9 Core Architecture IllustrationARM 7 and 9 Core Architecture Illustration
ARM 7 and 9 Core Architecture IllustrationJason J Pulikkottil
 
Microprocessors and Microcontrollers 8086 Pin Connections
Microprocessors and Microcontrollers 8086 Pin ConnectionsMicroprocessors and Microcontrollers 8086 Pin Connections
Microprocessors and Microcontrollers 8086 Pin ConnectionsJason J Pulikkottil
 
Basic Electronics Theory and Components
Basic Electronics Theory and ComponentsBasic Electronics Theory and Components
Basic Electronics Theory and ComponentsJason J Pulikkottil
 
Deep Learning for Health Informatics
Deep Learning for Health InformaticsDeep Learning for Health Informatics
Deep Learning for Health InformaticsJason J Pulikkottil
 

More from Jason J Pulikkottil (12)

Overview of computers and programming
Overview of computers and programmingOverview of computers and programming
Overview of computers and programming
 
Repetition, Basic loop structures, Loop programming techniques
Repetition, Basic loop structures, Loop programming techniquesRepetition, Basic loop structures, Loop programming techniques
Repetition, Basic loop structures, Loop programming techniques
 
Selection
SelectionSelection
Selection
 
Input and Output
Input and OutputInput and Output
Input and Output
 
Basic Elements of C++
Basic Elements of C++Basic Elements of C++
Basic Elements of C++
 
ARM 7 and 9 Core Architecture Illustration
ARM 7 and 9 Core Architecture IllustrationARM 7 and 9 Core Architecture Illustration
ARM 7 and 9 Core Architecture Illustration
 
Microprocessors and Microcontrollers 8086 Pin Connections
Microprocessors and Microcontrollers 8086 Pin ConnectionsMicroprocessors and Microcontrollers 8086 Pin Connections
Microprocessors and Microcontrollers 8086 Pin Connections
 
Basic Electronics Theory and Components
Basic Electronics Theory and ComponentsBasic Electronics Theory and Components
Basic Electronics Theory and Components
 
Electronic Circuits
Electronic CircuitsElectronic Circuits
Electronic Circuits
 
Power Amplifier
Power AmplifierPower Amplifier
Power Amplifier
 
E12 Resistor Series
E12 Resistor SeriesE12 Resistor Series
E12 Resistor Series
 
Deep Learning for Health Informatics
Deep Learning for Health InformaticsDeep Learning for Health Informatics
Deep Learning for Health Informatics
 

Recently uploaded

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoĂŁo Esperancinha
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 

Recently uploaded (20)

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 

Computer Architecture Performance and Energy

  • 1. 1 Computer Architecture Performance and Energy Consumption Computer Architecture CA: Performance and Power
  • 2. 2 Defining Performance n Which airplane has the best performance? CA: Performance and Power
  • 3. 3 Response Time and Throughput n Response time n How long it takes to do a task n Throughput n Total work done per unit time n e.g., tasks/transactions/… per hour n How are response time and throughput affected by n Replacing the processor with a faster version? n Adding more processors? n We’ll focus on response time for now… CA: Performance and Power
  • 4. 4 Relative Performance n Define Performance = 1/Execution Time n “X is n time faster than Y” n = = X Y Y X time Execution time Execution e Performanc e Performanc n Example: time taken to run a program n 10s on A, 15s on B n Execution TimeB / Execution TimeA = 15s / 10s = 1.5 n So A is 1.5 times faster than B CA: Performance and Power
  • 5. 5 Measuring Execution Time n Elapsed time n Total response time, including all aspects n Processing, I/O, OS overhead, idle time n Determines system performance n CPU time n Time spent processing a given job n Discounts I/O time, other jobs’ shares n Comprises user CPU time and system CPU time n Different programs are affected differently by CPU and system performance CA: Performance and Power
  • 6. 6 Analyze the Right Measurement! CA: Performance and Power Guides CPU design Guides system design CPU Time: Measuring CPU time (Ubuntu): $ time <program name> real 0m0.095s user 0m0.013s sys 0m0.008s Real elapsed time (in minutes and seconds) Time the CPU spends running program under measurement Total response time = CPU time + time spent waiting (for disk, I/O, ..)
  • 7. 7 CPU Clocking n Operation of digital hardware governed by a constant-rate clock Clock (cycles) Data transfer and computation Update state Clock period n Clock period: duration of a clock cycle n e.g., 250ps = 0.25ns = 250×10–12 s n Clock frequency (rate): cycles per second n e.g., 4.0GHz = 4000MHz = 4.0×109 Hz CA: Performance and Power
  • 8. 8 CPU Time n Performance improved by n Reducing number of clock cycles n Increasing clock rate n Hardware designer must often trade off clock rate against cycle count Rate Clock Cycles Clock CPU Time Cycle Clock Cycles Clock CPU Time CPU = ´ = CA: Performance and Power
  • 9. 9 CPU Time Example n Computer A: 2GHz clock, 10s CPU time n Designing Computer B n Aim for 6s CPU time n Can do faster clock, but causes 1.2 × clock cycles n How fast must Computer B clock be? 4GHz 6s 10 24 6s 10 20 1.2 Rate Clock 10 20 2GHz 10s Rate Clock Time CPU Cycles Clock 6s Cycles Clock 1.2 Time CPU Cycles Clock Rate Clock 9 9 B 9 A A A A B B B = ´ = ´ ´ = ´ = ´ = ´ = ´ = = CA: Performance and Power
  • 10. 10 Instruction Count and CPI n Instruction Count (IC) for a program n Determined by program, ISA and compiler n Average cycles per instruction (CPI) n Depends on program, CPU hardware and compiler n If different instructions have different CPI n Average CPI affected by instruction mix Rate Clock CPI Count n Instructio Time Cycle Clock CPI Count n Instructio Time CPU n Instructio per Cycles Count n Instructio Cycles Clock ´ = ´ ´ = ´ = CA: Performance and Power
  • 11. 11 CPI Example n Computer A: Cycle Time = 250ps, CPI = 2.0 n Computer B: Cycle Time = 500ps, CPI = 1.2 n Same ISA n Which is faster, and by how much? 1.2 500ps I 600ps I A Time CPU B Time CPU 600ps I 500ps 1.2 I B Time Cycle B CPI Count n Instructio B Time CPU 500ps I 250ps 2.0 I A Time Cycle A CPI Count n Instructio A Time CPU = ´ ´ = ´ = ´ ´ = ´ ´ = ´ = ´ ´ = ´ ´ = A is faster… …by this much CA: Performance and Power
  • 12. 12 How Calculate the 3 Components? n Clock Cycle Time: in specification of computer (Clock Rate in advertisements) n Instruction Count: n Count instructions in loop of small program n Use simulator to count instructions n Hardware counter in spec. register (most CPUs) °CPI: • Calculate: Execution Time / Clock cycle time Instruction Count • Hardware counter in special register (most CPUs) CA: Performance and Power
  • 13. 13 CPI in More Detail n If different instruction classes take different numbers of cycles ĂĽ = ´ = n 1 i i i ) Count n Instructio (CPI Cycles Clock n Weighted average CPI ĂĽ = á ø Ăś ç è ĂŚ ´ = = n 1 i i i Count n Instructio Count n Instructio CPI Count n Instructio Cycles Clock CPI Relative frequency CA: Performance and Power
  • 14. 14 Calculating Average CPI n First read CPIi for each individual instruction (add, sub, and, etc.) n Next use when it’s given or calculate relative frequency fi of each individual instruction n Finally multiply these two for each instruction and add them up to get final CPI 𝐶𝑃𝐼 = % !"# $ 𝑓𝑖 ∗ 𝐶𝑃𝐼𝑖 𝑓𝑖 = 𝐼𝐶𝑖/𝐼𝐶 CA: Performance and Power
  • 15. 15 Example with Bonus Points to Earn! Op Freqi CPIi Prod (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 15% 2 .3 (20%) Branch 15% 2 .3 (20%) 1.5 • What if you can make branch instructions twice as fast (CPIbr= 1 cycle) but clock rate (CR) will decrease by 12%? Will it be a speedup or slowdown and how much? Instruction Mix (Where time spent) CA: Performance and Power New CPI = 1.35 Time before change = IC*1.5 / CR = IC*1.5/CR Time after change = IC*1.35 / (0.88*CR) = IC*1.534/CR Time before/Time after = IC*1.5*CR / IC*1.534*CR = 0.978 => 2.2% slowdown Speedup < 1 because the time after is greater than the time before the change
  • 16. 16 Another CPI Example n Alternative compiled code sequences using instructions in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 n Sequence 1: IC = 5 n Clock Cycles = 2×1 + 1×2 + 2×3 = 10 n Avg. CPI = 10/5 = 2.0 n Sequence 2: IC = 6 n Clock Cycles = 4×1 + 1×2 + 1×3 = 9 n Avg. CPI = 9/6 = 1.5 CA: Performance and Power
  • 17. 17 CPU Performance Law n Performance depends on n Algorithm: affects IC, possibly CPI n Programming language: affects IC, CPI n Compiler: affects IC, CPI n Instruction set architecture: affects IC, CPI, Tc n Processor microarchitecture (organization): affects CPI, Tc n Technology: affects Tc The BIG Picture Rate Clock CPI Count n Instructio Time Cycle Clock CPI Count n Instructio Time CPU n Instructio per Cycles Count n Instructio Cycles Clock ´ = ´ ´ = ´ = cycle Clock Seconds n Instructio cycles Clock Program ns Instructio Time CPU ´ ´ = CA: Performance and Power
  • 18. 18 What Programs Measure for Comparison? n Ideally run typical programs with typical input before purchase, or before even build machine n Called a “workload”; For example: n Engineer uses compiler, spreadsheet n Author uses word processor, drawing program, compression software n In some situations it’s hard to do n Don’t have access to machine to “benchmark” before purchase n Don’t know workload in future CA: Performance and Power
  • 19. 19 SPEC CPU Benchmarks n Standard Performance Evaluation Corporation (SPEC) www.spec.org n Elapsed time to execute a selection of programs n Negligible I/O, so focuses on CPU performance n SPEC95: 8 integer (gcc, compress, li, ijpeg, perl, ...) & 10 floating-point (FP) programs (hydro2d, mgrid, applu, turbo3d, ...) n SPEC2000: 11 integer (gcc, bzip2, …), 18 FP (mgrid, swim, ma3d, …) n Separate average for integer and FP n Benchmarks distributed in source code n Compiler, machine designers target benchmarks, so try to change every 3 years CA: Performance and Power
  • 20. 20 How Summarize Suite Performance (1/4) n Arithmetic average of execution time of all programs? n But they vary by 4X in speed, so some would be more important than others in arithmetic average n Could add a weights per program, but how pick weight? n Different companies want different weights for their products n SPECRatio: Normalize execution times to reference computer, yielding a ratio proportional to performance = time on reference computer time on computer being rated CA: Performance and Power
  • 21. 21 How Summarize Suite Performance (2/4) n If program SPECRatio on Computer A is 1.25 times bigger than Computer B, then B A A B B reference A reference B A e Performanc e Performanc ime ExecutionT ime ExecutionT ime ExecutionT ime ExecutionT ime ExecutionT ime ExecutionT SPECRatio SPECRatio = = = = 25 . 1 • Note that when comparing 2 computers as a ratio, execution times on the reference computer drop out, so choice of reference computer is irrelevant CA: Performance and Power
  • 22. 22 How Summarize Suite Performance (3/4) n Since ratios, proper mean is geometric mean (SPECRatio unitless, so arithmetic mean meaningless) n n i i SPECRatio ean GeometricM Õ = = 1 1. Geometric mean of the ratios is the same as the ratio of the geometric means 2. Ratio of geometric means = Geometric mean of performance ratios Þ choice of reference computer is irrelevant! • These two points make geometric mean of ratios attractive to summarize performance CA: Performance and Power
  • 23. 23 SPECspeed 2017 Integer Benchmarks on a 1.8 GHz Intel Xeon E5-2650L CA: Performance and Power
  • 24. 24 Analysis: CT involves breaking down complex ideas, arguments, or situations into smaller components to understand their structure and relationships. Evaluation: It entails assessing the validity, reliability, and credibility of information, sources, and arguments. This involves considering evidence, biases, and potential shortcomings. Interpretation: CT involves interpreting information and data to derive meaning, identify patterns, and draw conclusions. Inference: It includes making reasonable and well-supported conclusions based on available evidence and information. Problem Solving: CT aids in solving problems by approaching them systematically, considering different perspectives, and evaluating potential solutions. Contextualization: It requires considering the broader context and implications of information, ideas, and decisions. Communication: CT involves effectively conveying ideas, arguments, and analyses to others and engaging in constructive discussions. Skepticism: Critical thinkers approach information with a healthy dose of skepticism, questioning assumptions and seeking evidence. Open-Mindedness: It involves being open to different viewpoints, considering alternative explanations, and being willing to change one's views in light of new evidence. Reflection: It includes reflecting on one's own thinking processes, biases, & assumptions. Critical thinking (CT) involves several key elements (in the words of AI itself): What’s at stake for you here Is a single mean a good predictor of the performance of programs in benchmark suite? How much confidence you can have when using it to compare different processors or to predict future performance on similar apps? It's tough to make predictions, especially about the future. Yogi Berra
  • 25. 25 How Summarize Suite Performance (4/4) n Does a single mean well summarize performance of programs in benchmark suite? n Can decide if mean a good predictor by characterizing variability of distribution using standard deviation n Like geometric mean, geometric standard deviation is multiplicative rather than arithmetic n Can simply take the logarithm of SPECRatios, compute the standard mean and standard deviation, and then take the exponent to convert back: n The geometric standard deviation, denoted by σg, is calculated as follows: log σg=[1/n∑n i=1(logxi−logG)2]1/2. n where G=n√x1⋅x2⋅…⋅xn is the geometric mean of SPECRatios (x1 . xn). ( ) ( ) ( ) ( ) i n i i SPECRatio StDev tDev GeometricS SPECRatio n ean GeometricM ln exp ln 1 exp 1 = á ø Ăś ç è ĂŚ ´ = ĂĽ = CA: Performance and Power
  • 26. 26 Example Standard Deviation: (1/3) 0 2000 4000 6000 8000 10000 12000 14000 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi SPECfpRatio 1372 5362 2712 GM = 2712 GStDev = 1.98 • GM and multiplicative StDev of SPECfp2000 for Itanium 2 Outside 1 StDev Itanium 2 is 2712/100 times as fast as Sun Ultra 5 (GM), & range within 1 Std. Deviation is [13.72, 53.62] CA: Performance and Power
  • 27. 27 Example Standard Deviation : (2/3) • GM and multiplicative StDev of SPECfp2000 for AMD Athlon 0 2000 4000 6000 8000 10000 12000 14000 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi SPECfpRatio 1494 2911 2086 GM = 2086 GStDev = 1.40 Outside 1 StDev Athon is 2086/100 times as fast as Sun Ultra 5 (GM), & range within 1 Std. Deviation is [14.94, 29.11] CA: Performance and Power
  • 28. 28 Example Standard Deviation (3/3) - 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi Ratio Itanium 2 v. Athlon for SPECfp2000 0.75 2.27 1.30 GM = 1.30 GStDev = 1.74 • GM and StDev Itanium 2 v Athlon Outside 1 StDev Ratio execution times (At/It) = Ratio of SPECratios (It/At) Itanium 2 1.30X Athlon (GM), 1 St.Dev. Range [0.75,2.27] CA: Performance and Power
  • 29. 29 Comments on Itanium 2 and Athlon n Standard deviation of 1.98 for Itanium 2 is much higher-- vs. 1.40--so results will differ more widely from the mean, and therefore are likely less predictable for Itanium 2 n Falling within one standard deviation: n 10 of 14 benchmarks (71%) for Itanium 2 n 11 of 14 benchmarks (78%) for Athlon n Thus, the results are quite compatible with a lognormal distribution (expect 68%) n Itanium 2 vs. Athlon St.Dev is 1.74, which is high, so less confidence in claim that Itanium 1.30 times as fast as Athlon n Indeed, Athlon faster on 6 of 14 programs CA: Performance and Power
  • 30. 30 Amdahl’s Law ( ) enhanced enhanced enhanced Enh. w/ Enh. w/o overall Speedup Fraction Fraction 1 1 ExTime ExTime Speedup + - = = Best you could ever hope to do: ( ) enhanced maximum Fraction - 1 1 Speedup = ( ) Ăş Ăť Ăš ĂŞ ĂŤ ĂŠ + - ´ = enhanced enhanced enhanced Enh. w/o Enh. w/ Speedup Fraction Fraction 1 ExTime ExTime F CA: Performance and Power
  • 31. 31 Amdahl’s Law Example n New CPU 10X faster n I/O bound server, so 60% time waiting for I/O ( ) ( ) 56 . 1 64 . 0 1 10 0.4 0.4 1 1 Speedup Fraction Fraction 1 1 Speedup enhanced enhanced enhanced overall = = + - = + - = • Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster CA: Performance and Power
  • 32. 32 Question n Speedup = 1 (1 - F) + F S Question: Suppose a program spends 80% of its time in a square root routine. How much must you speed up square root to make the program run 5 times faster? 10 (A) 20 (B) None of the above (C) 100 (D) CA: Performance and Power
  • 33. 33 Consequence of Amdahl’s Law n The amount of speedup that can be achieved through parallelism is limited by the non-parallel portion of your program! Speedup Number of Processors Parallel portion Serial portion Time Number of Processors 1 2 3 4 5 CA: Performance and Power
  • 34. 34 Parallel Speed-up Examples (1/2) n 10 “scalar” operations (non-parallelizable) n 100 parallelizable operations n Say, element-wise addition of two 10x10 matrices. n 110 operations n 100/110 = .909 Parallelizable, 10/110 = 0.091 Scalar Z1 + Z2 + … + Z10 X1,1 X1,10 X10,1 X10,10 Y1,1 Y1,10 Y10,1 Y10,10 + Non-parallel part Parallel part Partition 10 ways and perform on 10 parallel processing units . . . . . . CA: Performance and Power
  • 35. 35 Parallel Speed-up Examples (2/2) n Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup = 1/(.091 + .909/10) = 1/0.1819 = 5.5 n What if there are 100 processors ? Speedup = 1/(.091 + .909/100) = 1/0.10009 = 10.0 n What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? Speedup = 1/(.001 + .999/10) = 1/0.1009 = 9.9 n What if there are 100 processors ? Speedup = 1/(.001 + .999/100) = 1/0.01099 = 91 Speedup w/ E = 1 / [ (1-F) + F/S ] CA: Performance and Power
  • 36. 36 Strong and Weak Scaling n To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem n Strong scaling: When speedup is achieved on a parallel processor without increasing the size of the problem n Weak scaling: When speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors (Gustafson's law) n Load balancing is another important factor: every processor doing same amount of work n Just 1 unit with twice the load of others cuts speedup almost in half (bottleneck!) CA: Performance and Power
  • 37. 37 Other Performance Metrics n MIPS – Million Instructions Per Second • MFLOPS - Million Floating-point Operations Per Second 6 10 ) ( _ ´ = s Time count n Instructio MIPS 6 10 ) ( / _ _ ´ = s Time program ops point Floating MFLOPS • PetaFLOPS - 1015 Floating-point Operations Per Second 15 10 ) ( / _ _ ´ = s Time program ops point Floating PFLOPS CA: Performance and Power
  • 38. 38 Pitfall: MIPS as a Performance Metric n MIPS: Millions of Instructions Per Second n Doesn’t account for n Differences in ISAs between computers n Differences in complexity between instructions 6 6 6 10 CPI rate Clock 10 rate Clock CPI count n Instructio count n Instructio 10 time Execution count n Instructio MIPS ´ = ´ ´ = ´ = n CPI varies between programs on a given CPU CA: Performance and Power
  • 39. 39 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency CA: Performance and Power
  • 40. 40 Multiprocessors n Multicore microprocessors n More than one processor per chip n Requires explicitly parallel programming n Compare with instruction level parallelism n Hardware executes multiple instructions at once n Hidden from the programmer n Hard to do n Programming for performance n Load balancing n Optimizing communication and synchronization CA: Performance and Power
  • 41. 41 Supercomputing ▪ Today: clusters of multi-core CPUs + GPUs ▪ Frontier: The Exascale-class HPE Cray EX Supercomputer at Oak Ridge National Laboratory (the fastest supercomputer in the world in 2023) ▪ 9,472 AMD Epyc 7453s "Trento" 64 core 2 GHz CPUs (606,208 cores) and 37,888 Instinct MI250X GPUs (8,335,360 cores). ▪ the most efficient supercomputer: 62.68 gigaflops/watt. Space 680 m 2 (7,300 sq ft) Speed 1.194 exaFLOPS (Rmax) / 1.67982 exaFLOPS (Rpeak) Cost US$600 million (est. cost) Purpose Scientific research and development CA: Performance and Power
  • 42. 42 Top 5 Supercomputers (TOP500, June 2023) RankSystem Cores Rmax (PFlop/s) Rpeak (PFlop/s) Power (kW) 1 Frontier - HPE Cray EX235a, AMD Optimized 3rd Generation EPYC 64C 2GHz, AMD Instinct MI250X, Slingshot-11, HPE DOE/SC/Oak Ridge National Laboratory United States 8,699,904 1,194.00 1,679.82 22,703 2 Supercomputer Fugaku - Supercomputer Fugaku, A64FX 48C 2.2GHz, Tofu interconnect D, Fujitsu RIKEN Center for Computational Science Japan 7,630,848 442.01 537.21 29,899 3 LUMI - HPE Cray EX235a, AMD Optimized 3rd Generation EPYC 64C 2GHz, AMD Instinct MI250X, Slingshot-11, HPE EuroHPC/CSC Finland 2,220,288 309.10 428.70 6,016 4 Leonardo - BullSequana XH2000, Xeon Platinum 8358 32C 2.6GHz, NVIDIA A100 SXM4 64 GB, Quad-rail NVIDIA HDR100 Infiniband, Atos EuroHPC/CINECA Italy 1,824,768 238.70 304.47 7,404 5 Summit - IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM DOE/SC/Oak Ridge National Laboratory United States 2,414,592 148.60 200.79 10,096 CA: Performance and Power
  • 43. 43 Power Trends n In CMOS IC technology Frequency Voltage load Capacitive Power 2 ´ ´ = ×1000 ×40 5V → 1V n Intel 80386 consumed ~ 2 W n 3.3 GHz Intel Core i7 consumes 130 W n Heat must be dissipated from 1.5 x 1.5 cm chip n This is the limit of what can be cooled by air CA: Performance and Power
  • 44. 44 Energy and Power n Dynamic energy n Transistor switch from 0 -> 1 or 1 -> 0 n ½ x Capacitive load x Voltage2 n Dynamic power n ½ x Capacitive load x Voltage2 x Frequency switched n Reducing clock rate reduces power, not energy n Static power consumption n Currentstatic x Voltage n Scales with number of transistors n To reduce: power gating CA: Performance and Power
  • 45. 45 Switching Energy: Fundamental Physics Every logic transition dissipates energy. How can we limit switching energy? (1) Reduce # of clock transitions. But we have work to do ... (2) Reduce Vdd. But lowering Vdd limits the clock speed ... (3) Fewer circuits. But more transistors can do more work. (4) Reduce C per node. One reason why we scale processes. V dd 1 2 C V dd E0->1 = 2 V dd 1 2 C V dd E1->0 = 2 C Strong result: Independent of technology. CA: Performance and Power
  • 46. 46 0V = Second Factor: Leakage Currents Even when a logic gate isn’t switching, it burns power. Igate: Ideal switches have zero DC current. But modern transistor gates are a few atoms thick, and are not ideal. Isub: Even when this nFet is off, it passes an Ioff leakage current. We can engineer any Ioff we like, but a lower Ioff also results in a lower Ion, and thus a lower maximum clock speed. Intel’s 2006 processor designs, leakage vs switching power A lot of work was done to get a ratio this good ... 50/50 is common. CA: Performance and Power
  • 47. 47 Example of Quantifying Power n Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power? dynamic dynamic dynamic OldPower OldPower witched FrequencyS Voltage Load Capacitive witched FrequencyS Voltage Load Capacitive Power ´ ´ ´ ´ ´ ´ ´ ´ ´ Âť = ´ = = 6 . 0 ) 85 (. ) 85 (. 85 . 2 / 1 2 / 1 3 2 2 CA: Performance and Power