1. HotChips 2002 - Intel® Itanium® 2 Processor
A Brief Analysis of the SPEC
CPU2000 Benchmarks on the
Intel® Itanium® 2 Processor
James McCormick, HP
Allan Knies, Intel
All trademarks are the property of their respective owners
2. HotChips 2002 - Intel® Itanium® 2 Processor
Agenda
This is a data-oriented presentation, not research
Agenda
– Brief performance summary
– Comparison of how HP and Intel compilers use
Itanium® architecture features and compare to best
RISC.
– Analysis of microarchitectural features of the Intel®
Itanium® 2 processor and how it affects SPEC
CPU2000 performance
3. HotChips 2002 - Intel® Itanium® 2 Processor
Overall Performance
SPEC{int/fp}_base2000 810/1356 (best 0.18u)
Linpack 1000: 3.5 Gflops (best overall)
TPC-C (SQL/4P): 78K tpmC (best 4P number)
SPECweb_SSL: 1520 connections (best of class)
¾ Itanium® 2 processor best of class on a wide range of applications
SPEC CPU2000 Results
0
200
400
600
800
1000
1200
1400
1600
SPECint_base2000 SPECint_peak2000 SPECfp_base2000 SPECfp_peak2000
results from www.spec.org
Itanium2/HP 1Ghz IBM Power4 1.3 Ghz Alpha 21264 1Ghz UltraSPARC IIICu 1.05 Ghz
4. HotChips 2002 - Intel® Itanium® 2 Processor
Part I: ISA and Compiler
Comparisons
Allan Knies
Itanium® Architecture and Performance
Team
Intel Corporation
allan.knies@intel.com
5. HotChips 2002 - Intel® Itanium® 2 Processor
• Number of ‘useful instructions’ (shown in blue) +/- 1% of Alpha
• Total instructions (blue+green) +20-30% of Alpha (due to NOPs)
• Bundling is main cause of extra NOPS
¾ We’ll see that extra instr/nops are not substantially impacting perf
Instruction Mix: High Level
0.0E+00
5.0E+10
1.0E+11
1.5E+11
2.0E+11
2.5E+11
3.0E+11
3.5E+11
4.0E+11
Itanium2/Intel Itanium2/HP Alpha
TotalInstructions
Branch or Not Predicated Predicate True Predicate False Nop
6. HotChips 2002 - Intel® Itanium® 2 Processor
Instruction Mix Details Introduction
Itanium/Alpha ISAs:
• Itanium® arch has 40% fewer memory operations and 30% fewer
branches than Alpha (some impact from no-pgo on Alpha)
• Itanium arch has about 10% more ALU/compares/shifts than Alpha
• Itanium arch has about 20-30% NOPs – eventually expect this to be
under 20%
HP/Intel Compiler:
• HP compiler uses more memory and ALU ops than Intel
• Implies HP more conservative with registers – we’ll see impact later
¾ Itanium® architecture trades more ‘easy’ instructions (NOP, alu,
cmp) for reducing the ‘hard’ instructions (branch, load)
¾ More than one way to get good performance from a compiler
7. HotChips 2002 - Intel® Itanium® 2 Processor
• Alpha has 1.4x mem refs of Itanium® architecture (incl/RSE ops)
• RSE ops are easy to optimize in the future, if needed
• HP compiler is more consv with regs, but has more ALU/reloads
¾ Large register file, good compiler technology, and RSE pay off
¾ HP has lower RSE costs, but more reloads/alu ops
Instruction Mix: Memory Instruction Details
0.0E+00
1.0E+10
2.0E+10
3.0E+10
4.0E+10
5.0E+10
6.0E+10
7.0E+10
8.0E+10
9.0E+10
1.0E+11
Itanium2/Intel Itanium2/HP Alpha
MemoryInstructions
RSE
Stores
Loads
8. HotChips 2002 - Intel® Itanium® 2 Processor
• Total time spent in RSE is only 3%-4% of overall execution
• 1½ -3 cycles per call/return for RSE spill/fill activity
• 1-3 instructions per subroutine setup for RSE
• Intel compiler has fewer calls, but more cycles/call
¾ Register stack provides very low overhead call/return support
Features: Register Stack
0.0E +00
2.0E +09
4.0E +09
6.0E +09
8.0E +09
1.0E +10
Intel HP
RS E
TotalRSECycles
Total RS E Cycles
9. HotChips 2002 - Intel® Itanium® 2 Processor
• HP compiler generates 13% fewer br’s than Intel, 9% more mispredictions
• HP compiler generates 31% fewer branches than Alpha
• Itanium-based binaries fewer tk br’s than Alpha, data skewed by lack of PGO
¾ Itanium® architecture reduces the # branches and branch mispredicts
¾ HP/Intel compilers both reduce # branches – but with different focus
Instruction Mix: Branch Detail
1.389E+10 1.282E+10
2.42E+10
1.895E+10
1.624E+10
1.383E+10
0.0E+00
5.0E+09
1.0E+10
1.5E+10
2.0E+10
2.5E+10
3.0E+10
3.5E+10
4.0E+10
Itanium2/Intel Itanium2/HP Alpha
BranchInstructions
Not Taken Branches
Taken Branches
10. HotChips 2002 - Intel® Itanium® 2 Processor
• Useful instructions per taken branch is very high
• Useful instructions per mispredicted branch slightly better for Intel
• Useful instructions/call very high – Intel/HP compilers very aggr inlining
¾ Advanced compilers reduce stress on br prediction/Icache HW
¾ Trading Istream size for regularity improves HW efficiency
Parallelism: Region Size
19
159
212
20
141
158
0
50
100
150
200
250
U s eful Ins tr/Tk Br U s eful Ins t/MP Branch U s eful Ins t/C all
UsefulInstructions
Intel
HP
11. HotChips 2002 - Intel® Itanium® 2 Processor
• About 20-30% of loads are speculative in Intel binaries
• Data shows tiny penalty for chk.s usage despite high usage rate
• Intel has 10x more chk.s than HP, HP uses ‘no recovery model’
selectively (per benchmark decision)
• HP has 10x more chk.a/ld.c then Intel, recovery less than 1% time
¾ Speculation heavily used, but causes little overhead
Features: Speculation
1.0E+00
1.0E+02
1.0E+04
1.0E+06
1.0E+08
1.0E+10
Intel HP Intel HP
chk.s chk.a and ld.c
Instructions # Checks # Failed
12. HotChips 2002 - Intel® Itanium® 2 Processor
• Useful IPC computed using ‘unstalled IPC’
• Compilers find 2.5-3.0 IPC in integer apps (even beyond SPEC)
• Dynamic delays reduce this to 1.3 achieved for CPU2000 integer
¾ Differences in perf/heuristics shows headroom for both compilers
¾ Good IPC found by both compilers, room for uArch improvements
Useful IPC
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
gzip vpr gcc mcf crafty parser eon perl gap vortex bzip2 twolf A vg
UsefulIPC
A chieved Found
13. HotChips 2002 - Intel® Itanium® 2 Processor
Notes
• Itanium-based binaries used for these stats are older than those
used for the official SPEC submission (less than 10% difference)
• The results for Intel® Itanium® 2 processor in Part I are: one with
the Intel compiler running on 64-bit MS OS and another with the
HP compiler running under HP-UX
• Alpha ISA numbers via simulation, binaries used near peak (no
profile guided optimization), tuned for 21264. Alpha data missing
VPR and PERL – thus, left out of all averages in Part I.
• Results computed with arithmetic averages – data thus skewed
towards long-running benchmarks
14. HotChips 2002 - Intel® Itanium® 2 Processor
Part II: Microarchitecture
James McCormick
HP Colorado VLSI Laboratory
15. HotChips 2002 - Intel® Itanium® 2 Processor
Cycle Accounting: INT
7%
38%
3%
1%
49%
2%
PIPE FLUSH DATA ACCESS
INSTR ACCESS SCOREBOARD
RSE UNSTALLED
Cycle Accounting: FP
1%
47%
1%
0%
50%
1%
PIPE FLUSH DATA ACCESS
INSTR ACCESS SCOREBOARD
RSE UNSTALLED
¾ Remaining: unstalled execution and data access
¾ Great performance
16. HotChips 2002 - Intel® Itanium® 2 Processor
Instructions Per Unstalled Cycle
0
1
2
3
4
5
6
Integer Floating Point
IPC
USEFUL w/ PRED OFF + NOPS
¾ High machine parallelism
17. HotChips 2002 - Intel® Itanium® 2 Processor
Total Instruction Fetch Latency
0
20
40
60
80
100
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
AVERAGE
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
AVERAGE
Integer Floating Point
Percent
L1I HIT L2 INSTR HIT L3 INSTR HIT MEM INSTR ACCESS
¾ Noticeable L1I misses
¾ Very small I-fetch component
18. HotChips 2002 - Intel® Itanium® 2 Processor
¾ High FE throughput rate
¾ I-miss latency hidden
Instruction Fetch Ahead
0
1
2
3
4
5
6
INT AVERAGE FP AVERAGE
InstructionsPerCycle
0
10
20
30
40
50
60
70
80
90
100
PercentExposed
DECOUPLED DISP THROUGHPUT DECOUPLED FE THROUGHPUT
FE BUBBLES EXPOSED
19. HotChips 2002 - Intel® Itanium® 2 Processor
Branch Mispredictions
37.7
0
2
4
6
8
10
12
14
16
18
20
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
AVERAGE
Integer
Misprediction
Percentage
0
2
4
6
8
10
12
14
16
18
20
FullPipelinesPer
Flush
WRONG PAT H WRONG T ARGET UNKNOWN FULL PIPELINES PER M ISPRED BR
¾ High accuracy, low penalty
¾ Helps instruction fetch
20. HotChips 2002 - Intel® Itanium® 2 Processor
Total Data Read Latency
0
10
20
30
40
50
60
70
80
90
100
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
AVERAGE
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
AVERAGE
Integer Floating Point
PercentofLatency
L1D DATA READ L2 DATA READ L3 DATA READ MEM DATA READ
¾ Large component, large opportunity
21. HotChips 2002 - Intel® Itanium® 2 Processor
Summary
Itanium® 2 Processor Delivers Leadership
Performance
– Architecture / Compilers
• Expose lots of ILP to in-order pipeline
• Replace difficult instructions with easy ones
• RSE and large register file work well together
– CPU Design
• Machine parallelism handles high ILP
• Aggressive design reduces bottlenecks
22. HotChips 2002 - Intel® Itanium® 2 Processor
Acknowledgements
Jason Cantin (U. Wisconsin):
Ran all the experiments to gather the Alpha ISA data with
changes/alterations at our request. Without Jason, we would have no
Alpha data of any kind. Thanks to his efforts to rerun all the data for
us!
Hwansoo Han, Youngsoo Choi, Geetha Vederaman (Intel):
Intel data gathering, formatting, methodology, performance monitoring
Bryan Black, Ed Grochowski, Jim Callister, Carole Dulong (Intel):
Extensive draft review and comments
Caliper Team (HP):
Tool support
24. HotChips 2002 - Intel® Itanium® 2 Processor
Configuration Information
Intel® Itanium® 2 processor data for Intel systems/compilers:
– Binaries: Pre-production version of Intel C++ 6.0 Compiler, -O3 with interprocedural
optimization and profile guided optimization
– Run on: Pre-production stepping of Itanium 2 processor 800Mhz/200Mhz core/bus, Intel 870
chipset, monitor kernel and user instructions and events
Intel® Itanium® 2 processor data for HP systems/compilers:
– Binaries: Pre-production version of HP compiler
– Run on: Pre-production stepping of Itanium 2 processor 1000Mhz/200Mhz core/bus, rx2600
prototype, monitor kernel and user instructions and events
Alpha ISA data:
– Run on: Functional simulator, system code not simulated
– Simulation details: http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/
– All data thanks to Jason Cantin at the University of Wisconsin.
– Binaries: http://www.eecs.umich.edu/~chriswea/benchmarks/Com.cf
In general, these binaries are optimized for 21264, peak optimization. Usually, –g3 –fast –O4,
but NO profile feedback. Compiler: DECCV5.9-005 and DIGIT AL C++ V6.1-027
Other remarks:
– All averages in slides left out PERL and VPR due to data not available for Alpha