SlideShare a Scribd company logo
1 of 24
Download to read offline
HotChips 2002 - Intel® Itanium® 2 Processor
A Brief Analysis of the SPEC
CPU2000 Benchmarks on the
Intel® Itanium® 2 Processor
James McCormick, HP
Allan Knies, Intel
All trademarks are the property of their respective owners
HotChips 2002 - Intel® Itanium® 2 Processor
Agenda
This is a data-oriented presentation, not research
Agenda
– Brief performance summary
– Comparison of how HP and Intel compilers use
Itanium® architecture features and compare to best
RISC.
– Analysis of microarchitectural features of the Intel®
Itanium® 2 processor and how it affects SPEC
CPU2000 performance
HotChips 2002 - Intel® Itanium® 2 Processor
Overall Performance
SPEC{int/fp}_base2000 810/1356 (best 0.18u)
Linpack 1000: 3.5 Gflops (best overall)
TPC-C (SQL/4P): 78K tpmC (best 4P number)
SPECweb_SSL: 1520 connections (best of class)
¾ Itanium® 2 processor best of class on a wide range of applications
SPEC CPU2000 Results
0
200
400
600
800
1000
1200
1400
1600
SPECint_base2000 SPECint_peak2000 SPECfp_base2000 SPECfp_peak2000
results from www.spec.org
Itanium2/HP 1Ghz IBM Power4 1.3 Ghz Alpha 21264 1Ghz UltraSPARC IIICu 1.05 Ghz
HotChips 2002 - Intel® Itanium® 2 Processor
Part I: ISA and Compiler
Comparisons
Allan Knies
Itanium® Architecture and Performance
Team
Intel Corporation
allan.knies@intel.com
HotChips 2002 - Intel® Itanium® 2 Processor
• Number of ‘useful instructions’ (shown in blue) +/- 1% of Alpha
• Total instructions (blue+green) +20-30% of Alpha (due to NOPs)
• Bundling is main cause of extra NOPS
¾ We’ll see that extra instr/nops are not substantially impacting perf
Instruction Mix: High Level
0.0E+00
5.0E+10
1.0E+11
1.5E+11
2.0E+11
2.5E+11
3.0E+11
3.5E+11
4.0E+11
Itanium2/Intel Itanium2/HP Alpha
TotalInstructions
Branch or Not Predicated Predicate True Predicate False Nop
HotChips 2002 - Intel® Itanium® 2 Processor
Instruction Mix Details Introduction
Itanium/Alpha ISAs:
• Itanium® arch has 40% fewer memory operations and 30% fewer
branches than Alpha (some impact from no-pgo on Alpha)
• Itanium arch has about 10% more ALU/compares/shifts than Alpha
• Itanium arch has about 20-30% NOPs – eventually expect this to be
under 20%
HP/Intel Compiler:
• HP compiler uses more memory and ALU ops than Intel
• Implies HP more conservative with registers – we’ll see impact later
¾ Itanium® architecture trades more ‘easy’ instructions (NOP, alu,
cmp) for reducing the ‘hard’ instructions (branch, load)
¾ More than one way to get good performance from a compiler
HotChips 2002 - Intel® Itanium® 2 Processor
• Alpha has 1.4x mem refs of Itanium® architecture (incl/RSE ops)
• RSE ops are easy to optimize in the future, if needed
• HP compiler is more consv with regs, but has more ALU/reloads
¾ Large register file, good compiler technology, and RSE pay off
¾ HP has lower RSE costs, but more reloads/alu ops
Instruction Mix: Memory Instruction Details
0.0E+00
1.0E+10
2.0E+10
3.0E+10
4.0E+10
5.0E+10
6.0E+10
7.0E+10
8.0E+10
9.0E+10
1.0E+11
Itanium2/Intel Itanium2/HP Alpha
MemoryInstructions
RSE
Stores
Loads
HotChips 2002 - Intel® Itanium® 2 Processor
• Total time spent in RSE is only 3%-4% of overall execution
• 1½ -3 cycles per call/return for RSE spill/fill activity
• 1-3 instructions per subroutine setup for RSE
• Intel compiler has fewer calls, but more cycles/call
¾ Register stack provides very low overhead call/return support
Features: Register Stack
0.0E +00
2.0E +09
4.0E +09
6.0E +09
8.0E +09
1.0E +10
Intel HP
RS E
TotalRSECycles
Total RS E Cycles
HotChips 2002 - Intel® Itanium® 2 Processor
• HP compiler generates 13% fewer br’s than Intel, 9% more mispredictions
• HP compiler generates 31% fewer branches than Alpha
• Itanium-based binaries fewer tk br’s than Alpha, data skewed by lack of PGO
¾ Itanium® architecture reduces the # branches and branch mispredicts
¾ HP/Intel compilers both reduce # branches – but with different focus
Instruction Mix: Branch Detail
1.389E+10 1.282E+10
2.42E+10
1.895E+10
1.624E+10
1.383E+10
0.0E+00
5.0E+09
1.0E+10
1.5E+10
2.0E+10
2.5E+10
3.0E+10
3.5E+10
4.0E+10
Itanium2/Intel Itanium2/HP Alpha
BranchInstructions
Not Taken Branches
Taken Branches
HotChips 2002 - Intel® Itanium® 2 Processor
• Useful instructions per taken branch is very high
• Useful instructions per mispredicted branch slightly better for Intel
• Useful instructions/call very high – Intel/HP compilers very aggr inlining
¾ Advanced compilers reduce stress on br prediction/Icache HW
¾ Trading Istream size for regularity improves HW efficiency
Parallelism: Region Size
19
159
212
20
141
158
0
50
100
150
200
250
U s eful Ins tr/Tk Br U s eful Ins t/MP Branch U s eful Ins t/C all
UsefulInstructions
Intel
HP
HotChips 2002 - Intel® Itanium® 2 Processor
• About 20-30% of loads are speculative in Intel binaries
• Data shows tiny penalty for chk.s usage despite high usage rate
• Intel has 10x more chk.s than HP, HP uses ‘no recovery model’
selectively (per benchmark decision)
• HP has 10x more chk.a/ld.c then Intel, recovery less than 1% time
¾ Speculation heavily used, but causes little overhead
Features: Speculation
1.0E+00
1.0E+02
1.0E+04
1.0E+06
1.0E+08
1.0E+10
Intel HP Intel HP
chk.s chk.a and ld.c
Instructions # Checks # Failed
HotChips 2002 - Intel® Itanium® 2 Processor
• Useful IPC computed using ‘unstalled IPC’
• Compilers find 2.5-3.0 IPC in integer apps (even beyond SPEC)
• Dynamic delays reduce this to 1.3 achieved for CPU2000 integer
¾ Differences in perf/heuristics shows headroom for both compilers
¾ Good IPC found by both compilers, room for uArch improvements
Useful IPC
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
Intel
HP
gzip vpr gcc mcf crafty parser eon perl gap vortex bzip2 twolf A vg
UsefulIPC
A chieved Found
HotChips 2002 - Intel® Itanium® 2 Processor
Notes
• Itanium-based binaries used for these stats are older than those
used for the official SPEC submission (less than 10% difference)
• The results for Intel® Itanium® 2 processor in Part I are: one with
the Intel compiler running on 64-bit MS OS and another with the
HP compiler running under HP-UX
• Alpha ISA numbers via simulation, binaries used near peak (no
profile guided optimization), tuned for 21264. Alpha data missing
VPR and PERL – thus, left out of all averages in Part I.
• Results computed with arithmetic averages – data thus skewed
towards long-running benchmarks
HotChips 2002 - Intel® Itanium® 2 Processor
Part II: Microarchitecture
James McCormick
HP Colorado VLSI Laboratory
HotChips 2002 - Intel® Itanium® 2 Processor
Cycle Accounting: INT
7%
38%
3%
1%
49%
2%
PIPE FLUSH DATA ACCESS
INSTR ACCESS SCOREBOARD
RSE UNSTALLED
Cycle Accounting: FP
1%
47%
1%
0%
50%
1%
PIPE FLUSH DATA ACCESS
INSTR ACCESS SCOREBOARD
RSE UNSTALLED
¾ Remaining: unstalled execution and data access
¾ Great performance
HotChips 2002 - Intel® Itanium® 2 Processor
Instructions Per Unstalled Cycle
0
1
2
3
4
5
6
Integer Floating Point
IPC
USEFUL w/ PRED OFF + NOPS
¾ High machine parallelism
HotChips 2002 - Intel® Itanium® 2 Processor
Total Instruction Fetch Latency
0
20
40
60
80
100
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
AVERAGE
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
AVERAGE
Integer Floating Point
Percent
L1I HIT L2 INSTR HIT L3 INSTR HIT MEM INSTR ACCESS
¾ Noticeable L1I misses
¾ Very small I-fetch component
HotChips 2002 - Intel® Itanium® 2 Processor
¾ High FE throughput rate
¾ I-miss latency hidden
Instruction Fetch Ahead
0
1
2
3
4
5
6
INT AVERAGE FP AVERAGE
InstructionsPerCycle
0
10
20
30
40
50
60
70
80
90
100
PercentExposed
DECOUPLED DISP THROUGHPUT DECOUPLED FE THROUGHPUT
FE BUBBLES EXPOSED
HotChips 2002 - Intel® Itanium® 2 Processor
Branch Mispredictions
37.7
0
2
4
6
8
10
12
14
16
18
20
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
AVERAGE
Integer
Misprediction
Percentage
0
2
4
6
8
10
12
14
16
18
20
FullPipelinesPer
Flush
WRONG PAT H WRONG T ARGET UNKNOWN FULL PIPELINES PER M ISPRED BR
¾ High accuracy, low penalty
¾ Helps instruction fetch
HotChips 2002 - Intel® Itanium® 2 Processor
Total Data Read Latency
0
10
20
30
40
50
60
70
80
90
100
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
AVERAGE
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
AVERAGE
Integer Floating Point
PercentofLatency
L1D DATA READ L2 DATA READ L3 DATA READ MEM DATA READ
¾ Large component, large opportunity
HotChips 2002 - Intel® Itanium® 2 Processor
Summary
Itanium® 2 Processor Delivers Leadership
Performance
– Architecture / Compilers
• Expose lots of ILP to in-order pipeline
• Replace difficult instructions with easy ones
• RSE and large register file work well together
– CPU Design
• Machine parallelism handles high ILP
• Aggressive design reduces bottlenecks
HotChips 2002 - Intel® Itanium® 2 Processor
Acknowledgements
Jason Cantin (U. Wisconsin):
Ran all the experiments to gather the Alpha ISA data with
changes/alterations at our request. Without Jason, we would have no
Alpha data of any kind. Thanks to his efforts to rerun all the data for
us!
Hwansoo Han, Youngsoo Choi, Geetha Vederaman (Intel):
Intel data gathering, formatting, methodology, performance monitoring
Bryan Black, Ed Grochowski, Jim Callister, Carole Dulong (Intel):
Extensive draft review and comments
Caliper Team (HP):
Tool support
HotChips 2002 - Intel® Itanium® 2 Processor
Backup/Reference Slides
HotChips 2002 - Intel® Itanium® 2 Processor
Configuration Information
Intel® Itanium® 2 processor data for Intel systems/compilers:
– Binaries: Pre-production version of Intel C++ 6.0 Compiler, -O3 with interprocedural
optimization and profile guided optimization
– Run on: Pre-production stepping of Itanium 2 processor 800Mhz/200Mhz core/bus, Intel 870
chipset, monitor kernel and user instructions and events
Intel® Itanium® 2 processor data for HP systems/compilers:
– Binaries: Pre-production version of HP compiler
– Run on: Pre-production stepping of Itanium 2 processor 1000Mhz/200Mhz core/bus, rx2600
prototype, monitor kernel and user instructions and events
Alpha ISA data:
– Run on: Functional simulator, system code not simulated
– Simulation details: http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/
– All data thanks to Jason Cantin at the University of Wisconsin.
– Binaries: http://www.eecs.umich.edu/~chriswea/benchmarks/Com.cf
In general, these binaries are optimized for 21264, peak optimization. Usually, –g3 –fast –O4,
but NO profile feedback. Compiler: DECCV5.9-005 and DIGIT AL C++ V6.1-027
Other remarks:
– All averages in slides left out PERL and VPR due to data not available for Alpha

More Related Content

What's hot

Pc based wire less data aquisition system using rf(1)
Pc based wire less data aquisition system using rf(1)Pc based wire less data aquisition system using rf(1)
Pc based wire less data aquisition system using rf(1)Vishalya Dulam
 
Microcontroller from basic_to_advanced
Microcontroller from basic_to_advancedMicrocontroller from basic_to_advanced
Microcontroller from basic_to_advancedImran Sheikh
 
Ei502microprocessorsmicrtocontrollerspart4 8051 Microcontroller
Ei502microprocessorsmicrtocontrollerspart4 8051 MicrocontrollerEi502microprocessorsmicrtocontrollerspart4 8051 Microcontroller
Ei502microprocessorsmicrtocontrollerspart4 8051 MicrocontrollerDebasis Das
 
PIC MICROCONTROLLERS -CLASS NOTES
PIC MICROCONTROLLERS -CLASS NOTESPIC MICROCONTROLLERS -CLASS NOTES
PIC MICROCONTROLLERS -CLASS NOTESDr.YNM
 
Microcontroller (8051) by K. Vijay Kumar
Microcontroller (8051) by K. Vijay KumarMicrocontroller (8051) by K. Vijay Kumar
Microcontroller (8051) by K. Vijay KumarVijay Kumar
 
At 89s51
At 89s51At 89s51
At 89s51Mr Giap
 
Ashish microcontroller 8051
Ashish microcontroller 8051Ashish microcontroller 8051
Ashish microcontroller 8051ASHISH RAJ
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing SlidesRonen Mendezitsky
 
INTRODUCTION TO MICROCONTROLLER
INTRODUCTION TO MICROCONTROLLERINTRODUCTION TO MICROCONTROLLER
INTRODUCTION TO MICROCONTROLLERAnkita Jaiswal
 
Presentation on Embedded system using micro controller by PARAS JHA
Presentation on Embedded system using micro controller  by PARAS JHAPresentation on Embedded system using micro controller  by PARAS JHA
Presentation on Embedded system using micro controller by PARAS JHAParas Jha
 
MICROCONTROLLER 8051- Architecture & Pin Configuration
MICROCONTROLLER 8051- Architecture & Pin Configuration MICROCONTROLLER 8051- Architecture & Pin Configuration
MICROCONTROLLER 8051- Architecture & Pin Configuration AKHIL MADANKAR
 
8051 microcontroller
8051 microcontroller 8051 microcontroller
8051 microcontroller Gaurav Verma
 
Microcontroller
MicrocontrollerMicrocontroller
MicrocontrollerSpitiq
 

What's hot (20)

Pc based wire less data aquisition system using rf(1)
Pc based wire less data aquisition system using rf(1)Pc based wire less data aquisition system using rf(1)
Pc based wire less data aquisition system using rf(1)
 
Microcontroller from basic_to_advanced
Microcontroller from basic_to_advancedMicrocontroller from basic_to_advanced
Microcontroller from basic_to_advanced
 
Ei502microprocessorsmicrtocontrollerspart4 8051 Microcontroller
Ei502microprocessorsmicrtocontrollerspart4 8051 MicrocontrollerEi502microprocessorsmicrtocontrollerspart4 8051 Microcontroller
Ei502microprocessorsmicrtocontrollerspart4 8051 Microcontroller
 
PIC MICROCONTROLLERS -CLASS NOTES
PIC MICROCONTROLLERS -CLASS NOTESPIC MICROCONTROLLERS -CLASS NOTES
PIC MICROCONTROLLERS -CLASS NOTES
 
Microcontroller (8051) by K. Vijay Kumar
Microcontroller (8051) by K. Vijay KumarMicrocontroller (8051) by K. Vijay Kumar
Microcontroller (8051) by K. Vijay Kumar
 
Atmega324 p
Atmega324 pAtmega324 p
Atmega324 p
 
At 89s51
At 89s51At 89s51
At 89s51
 
Chapter1
Chapter1Chapter1
Chapter1
 
Ashish microcontroller 8051
Ashish microcontroller 8051Ashish microcontroller 8051
Ashish microcontroller 8051
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing Slides
 
INTRODUCTION TO MICROCONTROLLER
INTRODUCTION TO MICROCONTROLLERINTRODUCTION TO MICROCONTROLLER
INTRODUCTION TO MICROCONTROLLER
 
Microcontroller 8051 basics (part I)
Microcontroller 8051 basics (part I)Microcontroller 8051 basics (part I)
Microcontroller 8051 basics (part I)
 
Presentation on Embedded system using micro controller by PARAS JHA
Presentation on Embedded system using micro controller  by PARAS JHAPresentation on Embedded system using micro controller  by PARAS JHA
Presentation on Embedded system using micro controller by PARAS JHA
 
MICROCONTROLLER 8051- Architecture & Pin Configuration
MICROCONTROLLER 8051- Architecture & Pin Configuration MICROCONTROLLER 8051- Architecture & Pin Configuration
MICROCONTROLLER 8051- Architecture & Pin Configuration
 
Microcontroller 8051 gs
Microcontroller 8051 gsMicrocontroller 8051 gs
Microcontroller 8051 gs
 
8051 MICROCONTROLLER
8051 MICROCONTROLLER 8051 MICROCONTROLLER
8051 MICROCONTROLLER
 
8051 microcontroller
8051 microcontroller 8051 microcontroller
8051 microcontroller
 
PIC_ARM_AVR
PIC_ARM_AVRPIC_ARM_AVR
PIC_ARM_AVR
 
Atmega8u2 mur
Atmega8u2 murAtmega8u2 mur
Atmega8u2 mur
 
Microcontroller
MicrocontrollerMicrocontroller
Microcontroller
 

Viewers also liked

Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02
Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02
Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02Richard Hong
 
Wordpress website development for Furniture designer & manufacturer in UK
Wordpress website development for Furniture designer & manufacturer in UKWordpress website development for Furniture designer & manufacturer in UK
Wordpress website development for Furniture designer & manufacturer in UKInfigic Digital Solutions
 
Champagne Taittinger Does Weddings
Champagne Taittinger Does WeddingsChampagne Taittinger Does Weddings
Champagne Taittinger Does WeddingsChantelle Pabros
 
Top 15 Tourist Destinations In Jaipur
Top 15 Tourist Destinations In JaipurTop 15 Tourist Destinations In Jaipur
Top 15 Tourist Destinations In JaipurTaxi Service Kharar
 
Parcial gerson armando martinez estrada.
Parcial gerson armando martinez estrada.Parcial gerson armando martinez estrada.
Parcial gerson armando martinez estrada.gersonmartinez33
 
Lesson 3 childhood
Lesson 3 childhoodLesson 3 childhood
Lesson 3 childhoodMorgan91
 
ECRE Afghan guidelines_2004
ECRE Afghan guidelines_2004ECRE Afghan guidelines_2004
ECRE Afghan guidelines_2004Alina Potts, MPH
 

Viewers also liked (13)

Above Board
Above BoardAbove Board
Above Board
 
Hecho punible,caracteresdel del delito
Hecho punible,caracteresdel del delitoHecho punible,caracteresdel del delito
Hecho punible,caracteresdel del delito
 
Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02
Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02
Ohrm competencybasedinterviews-eng-may11-110627085307-phpapp02
 
Wordpress website development for Furniture designer & manufacturer in UK
Wordpress website development for Furniture designer & manufacturer in UKWordpress website development for Furniture designer & manufacturer in UK
Wordpress website development for Furniture designer & manufacturer in UK
 
Champagne Taittinger Does Weddings
Champagne Taittinger Does WeddingsChampagne Taittinger Does Weddings
Champagne Taittinger Does Weddings
 
Ecommerce website development uk
Ecommerce website development ukEcommerce website development uk
Ecommerce website development uk
 
Top 15 Tourist Destinations In Jaipur
Top 15 Tourist Destinations In JaipurTop 15 Tourist Destinations In Jaipur
Top 15 Tourist Destinations In Jaipur
 
Parcial gerson armando martinez estrada.
Parcial gerson armando martinez estrada.Parcial gerson armando martinez estrada.
Parcial gerson armando martinez estrada.
 
Resume Austine
Resume AustineResume Austine
Resume Austine
 
Lesson 3 childhood
Lesson 3 childhoodLesson 3 childhood
Lesson 3 childhood
 
sustancias anti-nutricionales en no rumiantes
sustancias anti-nutricionales en no rumiantes sustancias anti-nutricionales en no rumiantes
sustancias anti-nutricionales en no rumiantes
 
ECRE Afghan guidelines_2004
ECRE Afghan guidelines_2004ECRE Afghan guidelines_2004
ECRE Afghan guidelines_2004
 
Hardware
HardwareHardware
Hardware
 

Similar to 04_mccormick

Best Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersBest Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
 
Ashutosh kumar ( JAMIA HAMDARD )
Ashutosh kumar ( JAMIA HAMDARD )Ashutosh kumar ( JAMIA HAMDARD )
Ashutosh kumar ( JAMIA HAMDARD )Ashutosh Kumar
 
11090510438953
1109051043895311090510438953
11090510438953Larry Boss
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersPremier Farnell
 
My Feb 2003 HPCA9 Keynote Slides - Billion Transistor Processor Chips
My Feb 2003  HPCA9 Keynote Slides - Billion Transistor Processor ChipsMy Feb 2003  HPCA9 Keynote Slides - Billion Transistor Processor Chips
My Feb 2003 HPCA9 Keynote Slides - Billion Transistor Processor ChipsDileep Bhandarkar
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyDavid Lecomber
 
Embedded training report(mcs 51)
Embedded training report(mcs 51)Embedded training report(mcs 51)
Embedded training report(mcs 51)Gurwinder Singh
 
Introduction2_PIC.ppt
Introduction2_PIC.pptIntroduction2_PIC.ppt
Introduction2_PIC.pptAakashRawat35
 
DESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA Camp
DESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA CampDESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA Camp
DESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA CampFPGA Central
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsMrMaKKaWi
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing PythonAdimianBE
 

Similar to 04_mccormick (20)

Intel Pentium Pro
Intel Pentium ProIntel Pentium Pro
Intel Pentium Pro
 
Best Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersBest Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing Clusters
 
QSpiders - Basic intel architecture
QSpiders - Basic intel architectureQSpiders - Basic intel architecture
QSpiders - Basic intel architecture
 
Ashutosh kumar ( JAMIA HAMDARD )
Ashutosh kumar ( JAMIA HAMDARD )Ashutosh kumar ( JAMIA HAMDARD )
Ashutosh kumar ( JAMIA HAMDARD )
 
11090510438953
1109051043895311090510438953
11090510438953
 
ATMEGA-169P.pdf
ATMEGA-169P.pdfATMEGA-169P.pdf
ATMEGA-169P.pdf
 
At89s51
At89s51At89s51
At89s51
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit Microcontrollers
 
My Feb 2003 HPCA9 Keynote Slides - Billion Transistor Processor Chips
My Feb 2003  HPCA9 Keynote Slides - Billion Transistor Processor ChipsMy Feb 2003  HPCA9 Keynote Slides - Billion Transistor Processor Chips
My Feb 2003 HPCA9 Keynote Slides - Billion Transistor Processor Chips
 
Processor
ProcessorProcessor
Processor
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
 
Embedded training report(mcs 51)
Embedded training report(mcs 51)Embedded training report(mcs 51)
Embedded training report(mcs 51)
 
8051 microcontroller
8051 microcontroller8051 microcontroller
8051 microcontroller
 
Introduction2_PIC.ppt
Introduction2_PIC.pptIntroduction2_PIC.ppt
Introduction2_PIC.ppt
 
DESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA Camp
DESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA CampDESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA Camp
DESIGN CHOICES FOR EMBEDDED REAL-TIME CONTROL SYSTEMS @ 4th FPGA Camp
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer Systems
 
IBM Flex System x440 Compute Node
IBM Flex System x440 Compute NodeIBM Flex System x440 Compute Node
IBM Flex System x440 Compute Node
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing Python
 

04_mccormick

  • 1. HotChips 2002 - Intel® Itanium® 2 Processor A Brief Analysis of the SPEC CPU2000 Benchmarks on the Intel® Itanium® 2 Processor James McCormick, HP Allan Knies, Intel All trademarks are the property of their respective owners
  • 2. HotChips 2002 - Intel® Itanium® 2 Processor Agenda This is a data-oriented presentation, not research Agenda – Brief performance summary – Comparison of how HP and Intel compilers use Itanium® architecture features and compare to best RISC. – Analysis of microarchitectural features of the Intel® Itanium® 2 processor and how it affects SPEC CPU2000 performance
  • 3. HotChips 2002 - Intel® Itanium® 2 Processor Overall Performance SPEC{int/fp}_base2000 810/1356 (best 0.18u) Linpack 1000: 3.5 Gflops (best overall) TPC-C (SQL/4P): 78K tpmC (best 4P number) SPECweb_SSL: 1520 connections (best of class) ¾ Itanium® 2 processor best of class on a wide range of applications SPEC CPU2000 Results 0 200 400 600 800 1000 1200 1400 1600 SPECint_base2000 SPECint_peak2000 SPECfp_base2000 SPECfp_peak2000 results from www.spec.org Itanium2/HP 1Ghz IBM Power4 1.3 Ghz Alpha 21264 1Ghz UltraSPARC IIICu 1.05 Ghz
  • 4. HotChips 2002 - Intel® Itanium® 2 Processor Part I: ISA and Compiler Comparisons Allan Knies Itanium® Architecture and Performance Team Intel Corporation allan.knies@intel.com
  • 5. HotChips 2002 - Intel® Itanium® 2 Processor • Number of ‘useful instructions’ (shown in blue) +/- 1% of Alpha • Total instructions (blue+green) +20-30% of Alpha (due to NOPs) • Bundling is main cause of extra NOPS ¾ We’ll see that extra instr/nops are not substantially impacting perf Instruction Mix: High Level 0.0E+00 5.0E+10 1.0E+11 1.5E+11 2.0E+11 2.5E+11 3.0E+11 3.5E+11 4.0E+11 Itanium2/Intel Itanium2/HP Alpha TotalInstructions Branch or Not Predicated Predicate True Predicate False Nop
  • 6. HotChips 2002 - Intel® Itanium® 2 Processor Instruction Mix Details Introduction Itanium/Alpha ISAs: • Itanium® arch has 40% fewer memory operations and 30% fewer branches than Alpha (some impact from no-pgo on Alpha) • Itanium arch has about 10% more ALU/compares/shifts than Alpha • Itanium arch has about 20-30% NOPs – eventually expect this to be under 20% HP/Intel Compiler: • HP compiler uses more memory and ALU ops than Intel • Implies HP more conservative with registers – we’ll see impact later ¾ Itanium® architecture trades more ‘easy’ instructions (NOP, alu, cmp) for reducing the ‘hard’ instructions (branch, load) ¾ More than one way to get good performance from a compiler
  • 7. HotChips 2002 - Intel® Itanium® 2 Processor • Alpha has 1.4x mem refs of Itanium® architecture (incl/RSE ops) • RSE ops are easy to optimize in the future, if needed • HP compiler is more consv with regs, but has more ALU/reloads ¾ Large register file, good compiler technology, and RSE pay off ¾ HP has lower RSE costs, but more reloads/alu ops Instruction Mix: Memory Instruction Details 0.0E+00 1.0E+10 2.0E+10 3.0E+10 4.0E+10 5.0E+10 6.0E+10 7.0E+10 8.0E+10 9.0E+10 1.0E+11 Itanium2/Intel Itanium2/HP Alpha MemoryInstructions RSE Stores Loads
  • 8. HotChips 2002 - Intel® Itanium® 2 Processor • Total time spent in RSE is only 3%-4% of overall execution • 1½ -3 cycles per call/return for RSE spill/fill activity • 1-3 instructions per subroutine setup for RSE • Intel compiler has fewer calls, but more cycles/call ¾ Register stack provides very low overhead call/return support Features: Register Stack 0.0E +00 2.0E +09 4.0E +09 6.0E +09 8.0E +09 1.0E +10 Intel HP RS E TotalRSECycles Total RS E Cycles
  • 9. HotChips 2002 - Intel® Itanium® 2 Processor • HP compiler generates 13% fewer br’s than Intel, 9% more mispredictions • HP compiler generates 31% fewer branches than Alpha • Itanium-based binaries fewer tk br’s than Alpha, data skewed by lack of PGO ¾ Itanium® architecture reduces the # branches and branch mispredicts ¾ HP/Intel compilers both reduce # branches – but with different focus Instruction Mix: Branch Detail 1.389E+10 1.282E+10 2.42E+10 1.895E+10 1.624E+10 1.383E+10 0.0E+00 5.0E+09 1.0E+10 1.5E+10 2.0E+10 2.5E+10 3.0E+10 3.5E+10 4.0E+10 Itanium2/Intel Itanium2/HP Alpha BranchInstructions Not Taken Branches Taken Branches
  • 10. HotChips 2002 - Intel® Itanium® 2 Processor • Useful instructions per taken branch is very high • Useful instructions per mispredicted branch slightly better for Intel • Useful instructions/call very high – Intel/HP compilers very aggr inlining ¾ Advanced compilers reduce stress on br prediction/Icache HW ¾ Trading Istream size for regularity improves HW efficiency Parallelism: Region Size 19 159 212 20 141 158 0 50 100 150 200 250 U s eful Ins tr/Tk Br U s eful Ins t/MP Branch U s eful Ins t/C all UsefulInstructions Intel HP
  • 11. HotChips 2002 - Intel® Itanium® 2 Processor • About 20-30% of loads are speculative in Intel binaries • Data shows tiny penalty for chk.s usage despite high usage rate • Intel has 10x more chk.s than HP, HP uses ‘no recovery model’ selectively (per benchmark decision) • HP has 10x more chk.a/ld.c then Intel, recovery less than 1% time ¾ Speculation heavily used, but causes little overhead Features: Speculation 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08 1.0E+10 Intel HP Intel HP chk.s chk.a and ld.c Instructions # Checks # Failed
  • 12. HotChips 2002 - Intel® Itanium® 2 Processor • Useful IPC computed using ‘unstalled IPC’ • Compilers find 2.5-3.0 IPC in integer apps (even beyond SPEC) • Dynamic delays reduce this to 1.3 achieved for CPU2000 integer ¾ Differences in perf/heuristics shows headroom for both compilers ¾ Good IPC found by both compilers, room for uArch improvements Useful IPC 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP Intel HP gzip vpr gcc mcf crafty parser eon perl gap vortex bzip2 twolf A vg UsefulIPC A chieved Found
  • 13. HotChips 2002 - Intel® Itanium® 2 Processor Notes • Itanium-based binaries used for these stats are older than those used for the official SPEC submission (less than 10% difference) • The results for Intel® Itanium® 2 processor in Part I are: one with the Intel compiler running on 64-bit MS OS and another with the HP compiler running under HP-UX • Alpha ISA numbers via simulation, binaries used near peak (no profile guided optimization), tuned for 21264. Alpha data missing VPR and PERL – thus, left out of all averages in Part I. • Results computed with arithmetic averages – data thus skewed towards long-running benchmarks
  • 14. HotChips 2002 - Intel® Itanium® 2 Processor Part II: Microarchitecture James McCormick HP Colorado VLSI Laboratory
  • 15. HotChips 2002 - Intel® Itanium® 2 Processor Cycle Accounting: INT 7% 38% 3% 1% 49% 2% PIPE FLUSH DATA ACCESS INSTR ACCESS SCOREBOARD RSE UNSTALLED Cycle Accounting: FP 1% 47% 1% 0% 50% 1% PIPE FLUSH DATA ACCESS INSTR ACCESS SCOREBOARD RSE UNSTALLED ¾ Remaining: unstalled execution and data access ¾ Great performance
  • 16. HotChips 2002 - Intel® Itanium® 2 Processor Instructions Per Unstalled Cycle 0 1 2 3 4 5 6 Integer Floating Point IPC USEFUL w/ PRED OFF + NOPS ¾ High machine parallelism
  • 17. HotChips 2002 - Intel® Itanium® 2 Processor Total Instruction Fetch Latency 0 20 40 60 80 100 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf AVERAGE 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi AVERAGE Integer Floating Point Percent L1I HIT L2 INSTR HIT L3 INSTR HIT MEM INSTR ACCESS ¾ Noticeable L1I misses ¾ Very small I-fetch component
  • 18. HotChips 2002 - Intel® Itanium® 2 Processor ¾ High FE throughput rate ¾ I-miss latency hidden Instruction Fetch Ahead 0 1 2 3 4 5 6 INT AVERAGE FP AVERAGE InstructionsPerCycle 0 10 20 30 40 50 60 70 80 90 100 PercentExposed DECOUPLED DISP THROUGHPUT DECOUPLED FE THROUGHPUT FE BUBBLES EXPOSED
  • 19. HotChips 2002 - Intel® Itanium® 2 Processor Branch Mispredictions 37.7 0 2 4 6 8 10 12 14 16 18 20 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf AVERAGE Integer Misprediction Percentage 0 2 4 6 8 10 12 14 16 18 20 FullPipelinesPer Flush WRONG PAT H WRONG T ARGET UNKNOWN FULL PIPELINES PER M ISPRED BR ¾ High accuracy, low penalty ¾ Helps instruction fetch
  • 20. HotChips 2002 - Intel® Itanium® 2 Processor Total Data Read Latency 0 10 20 30 40 50 60 70 80 90 100 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf AVERAGE 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi AVERAGE Integer Floating Point PercentofLatency L1D DATA READ L2 DATA READ L3 DATA READ MEM DATA READ ¾ Large component, large opportunity
  • 21. HotChips 2002 - Intel® Itanium® 2 Processor Summary Itanium® 2 Processor Delivers Leadership Performance – Architecture / Compilers • Expose lots of ILP to in-order pipeline • Replace difficult instructions with easy ones • RSE and large register file work well together – CPU Design • Machine parallelism handles high ILP • Aggressive design reduces bottlenecks
  • 22. HotChips 2002 - Intel® Itanium® 2 Processor Acknowledgements Jason Cantin (U. Wisconsin): Ran all the experiments to gather the Alpha ISA data with changes/alterations at our request. Without Jason, we would have no Alpha data of any kind. Thanks to his efforts to rerun all the data for us! Hwansoo Han, Youngsoo Choi, Geetha Vederaman (Intel): Intel data gathering, formatting, methodology, performance monitoring Bryan Black, Ed Grochowski, Jim Callister, Carole Dulong (Intel): Extensive draft review and comments Caliper Team (HP): Tool support
  • 23. HotChips 2002 - Intel® Itanium® 2 Processor Backup/Reference Slides
  • 24. HotChips 2002 - Intel® Itanium® 2 Processor Configuration Information Intel® Itanium® 2 processor data for Intel systems/compilers: – Binaries: Pre-production version of Intel C++ 6.0 Compiler, -O3 with interprocedural optimization and profile guided optimization – Run on: Pre-production stepping of Itanium 2 processor 800Mhz/200Mhz core/bus, Intel 870 chipset, monitor kernel and user instructions and events Intel® Itanium® 2 processor data for HP systems/compilers: – Binaries: Pre-production version of HP compiler – Run on: Pre-production stepping of Itanium 2 processor 1000Mhz/200Mhz core/bus, rx2600 prototype, monitor kernel and user instructions and events Alpha ISA data: – Run on: Functional simulator, system code not simulated – Simulation details: http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/ – All data thanks to Jason Cantin at the University of Wisconsin. – Binaries: http://www.eecs.umich.edu/~chriswea/benchmarks/Com.cf In general, these binaries are optimized for 21264, peak optimization. Usually, –g3 –fast –O4, but NO profile feedback. Compiler: DECCV5.9-005 and DIGIT AL C++ V6.1-027 Other remarks: – All averages in slides left out PERL and VPR due to data not available for Alpha