01 intel processor architecture core
Upcoming SlideShare
Loading in...5
×
 

01 intel processor architecture core

on

  • 1,795 views

 

Statistics

Views

Total Views
1,795
Views on SlideShare
1,455
Embed Views
340

Actions

Likes
3
Downloads
97
Comments
1

3 Embeds 340

http://www.ustudy.in 336
http://ustudy.in 2
http://www.google.co.in 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Fantastic slide! thanks
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

01 intel processor architecture core 01 intel processor architecture core Presentation Transcript

  • Intel® Core™ Microarchitecture Intel® Software College
  • Intel® Software CollegeObjectivesAfter completion of this module you will be able to describe• Components of an IA processor• Working flow of the instruction pipeline• Notable features of the architecture Intel® Processor Micro-architecture - Core® microarchitecture 2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable featuresMicro-architecture tourCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable featuresMicro-architecture tourCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Industrial Recognition Intel® Software CollegePC Format May 2006“Intel Strikes Back! Conroe is the name. Pistol-whipping Athlon64s into burger meat is the game..“ Intels Next Generation Microarchitecture Unveiled Real World Tech “Just as important as the technical innovations in Core MPUs, this microarchitecture will have a profound impact on the industry. “ Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.com “…the results were far more than we could hope for and itll be amusing to see AMDs response to this beat-down sessionIntel Regains Performance Crown, Anandtech“… At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performancethan what we’ve seen here.” Intel Reveals Conroe Architecture, Extremetech “… And not only was the Intel system running at 2.66GHz— a slower clock rate than the top Pentium 4—it was outpacing an overclocked Athlon 64 FX-60. Wrap your brain around that idea for a bit…” Conroe Benchmarks - Intel Showing Big Strength Hot Hardware.com Intel® Processor Micro-architecture - Core® microarchitecture “… Intel is poised to change the face of the desktop computing landscape…” 5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Performance Summary Intel® Core™ Microarchitecture dramatically boosts Intel platform performance • Conroe & Woodcrest drive clear Desktop/Server performance leadership • Merom extends Intel Mobile performance leadership Intel® Core™ Microarchitecture-based platforms set the bar in Performance and Energy Efficiency for the Multi- Core era • Intel’s 3rd generation dual-core (while competition stuck on 1st generation) • New Intel high-performance ‘engine’: Wider, Smarter, Faster, More Efficient Best Processor on the Planet: Energy-Efficient Performance 1 Energy- The “Core™ Effect”: Intel® Core™ Microarchitecture20% (Merom), broad roadmap accelerationsPerformance Boosts1 ! ramp fuels 40% (Conroe), 80% (Woodcrest) Intel® Processor Micro-architecture - Core® microarchitecture 6 1 Based on SPECint*_rate_base2000 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparation• Architecture VS Microarchitecture• CISC VS RISC• Performance Measurements• Pipeline Design• Power and Energy• Chip Multi-ProcessingNotable featuresMicro-architecture tourCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Architecture and Micro-architectureWhat is Computer Architecture?• Architecture is the set of features which are externally visible: • Instruction set • Registers • Addressing modes • Bus protocolsIntel Architectures (IA)• IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture) • X87 (Floating Point extension) • MMX (Multi-Media extension) • SSE, SSE2, SSE3 (SIMD Streaming Extension)• Intel® 64/EM64T (64-bit Integer extension of IA32) ? Go to detail!• IA64 (Intel new 64-bit architecture) • Itanium/Itainium2 processor family Intel® Processor Micro-architecture - Core® microarchitecture 8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeArchitecture and Micro-architecture (cont.)What is Micro-architecture?• Same as m–Architecture or u-Architecture• “Invisible” features that provide meaningful value to the end user (whatever makes you buy a new compatible PC) • Programs run faster Improved Performance • Reduced Power consumption Extended Battery life • H/W fits into Smaller Form Factor Intel® Processor Micro-architecture - Core® microarchitecture 9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Intel® Architecture History * IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing Examples:Architecture:Instruction set definition EPIC* (Itanium®) IA-32 IXA* (XScale)and compatibilityMicroarchitecture:Hardware implementation Examples:maintaining instruction setcompatibility with high-level P5 P6 Intel NetBurst® BaniasarchitectureProcessors:Productizedimplementation ofMicroarchitecture Examples: Pentium® 4 Pentium® Pro Pentium® Pentium® D Pentium® M Pentium® II/III Xeon® Intel® Processor Micro-architecture - Core® microarchitecture 10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture Processors Intel® NetBurst®+ New Innovations MobileMicroarchitecture Intel® Core™ 2 Duo/Quad/Extreme processors Intel® Processor Micro-architecture - Core® microarchitecture 11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeRISC Approach to CPU design (RISC = Reduced Instruction Set Computers) Optimize H/W for common basic operations • Fixed instruction length • Shorter Execution Pipeline • Ease of Instruction Level Parallelism • Large number of registers • Less memory accesses • ‘Load/Store’ architecture • Shorter Execution Pipeline • Ease of advancing Loads • Branch Hints • Reduce pipeline flush events • ‘Exotic’ stuff to be implemented in S/W with minimal H/W support • No ‘complex’ H/W instructions • Handle exceptional conditions in S/W Examples: MIPS, IBM Power and PowerPC, Sun Sparc Achieve Maximum performance by right partitioning between H/W and S/W Intel® Processor Micro-architecture - Core® microarchitecture 12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCISC Approach to CPU design (CISC = Complex Instruction Set Computers) Rich architecture • Variable length instructions. • Complex addressing modes. On-chip HW / SW partitioning required • H/W keeps executing ‘simple’ stuff • Complex instructions are ‘emulated’ using u-code routines from ROM • More instructions treated as ‘simple’ as more H/W is available COMPATIBILITY has some major advantages: • Large (and forever increasing) software base • Code development tools • Expertise • H/W - S/W spiral Example: Intel IA32, Motorola 680X0 Maximize information passed to the HW Intel® Processor Micro-architecture - Core® microarchitecture 13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegePerformance Measurement Performance is the reciprocal of the “Time of execution”: 1 1 Performance ≈ = Were: Time _ of _ Execution L * CPI * TC L = Code Length (# of machine instructions) CPI = Clock cycles Per Instruction Tc = Clock period (nSecs) Substitute: IPC = Instructions Per Cycle = 1/CPI F = Frequency = 1/Tc Improve ILP Improve Timing IPC * F Performance ≈ L Arch Enhancements Intel® Processor Micro-architecture - Core® microarchitecture 14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegePerformance Measurement (cont.) Benchmarks examplesPerformance considerations: • Industry Standard• Which Code/Application to run? • Spec (ISPEC, FSPEC)• Which OS? • TPC • Commercial• Which other components in the • SysMark platform? • MobileMark• Under which thermal conditions? • PCMark• Multithreading? Multiprocessing? • Sandra • ScienceMark • Applications • Video (Windows Media encoder, DivX) • Audio (Lame MP3) • Compression (RAR) • Content creation (3DSM, Photoshop, Premiere) • Latest Games (Doom III, FarCry, but changes fast) • Specific industries use specific benchmarks • Linux compilation, POVRay, LinPack, lmbench Intel® Processor Micro-architecture - Core® microarchitecture 15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeDesign Considerations for DifferentMarket SegmentsConstrains:• Thermally, area constrained Desktop• Unconstrained Extreme• Very area constrained Value• Thermally, Energy and Area constrained Mobile• Thermally, Energy ServersMicro-architecture is the Art of Tradeoffs between:• Schedule• Requirements / Standards• Performance• Features• Power / Energy• Area / Cost Intel® Processor Micro-architecture - Core® microarchitecture 16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeDesign MetricsIPC = Instructions per Cycle• The more the betterLatency – same as Response Time• The time interval between • when any request for data is made and • when the data transfer completes• The less the betterThroughput• The amount of work completed by the system per unit of time.• The more the better• ops/sec Intel® Processor Micro-architecture - Core® microarchitecture 17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCPU PipelineBreak the work to smaller pieces• Four basic stages of instruction life • Fetch - bring instruction to core • Decode - read operands from register • Execute - perform the operation • Writeback - save result to register• Execution timing of simple instructions (legend: “op src1,src2 dst”) add eax, ebx eax F D E W sub ecx, edx ecx F D E WIncreased throughput• increased number of completed instructions per cycle Intel® Processor Micro-architecture - Core® microarchitecture 18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegePipeline Design - Explore ParallelismNew instruction not always depends on previous one• Can start new instruction before previous one is finished• ...if different stages use different H/W resourcesRun instructions in parallel (pipeline)Add eax, ebx eax F D E WSub ecx, edx ecx F D E WOr edi, esi edi F D E WNeed to balance pipe stages• Each stage should take same time for best throughput and utilization Clock cycle is determined by the longest path! Fetch Decode Exec WB Fetch Decode Exec WB Fetch Decode Exec WB Fetch Decode Exec WB Intel® Processor Micro-architecture - Core® microarchitecture 19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegePipeline Design – Fighting StallsData flow dependency (instructions output/input)• Solved by bypasses, renaming etcControl flow dependencies• Solved by branch predictionOthers (Cache misses, long latency instructions)• Solved by other dynamic scheduling techniques ? Go to detail! Intel® Processor Micro-architecture - Core® microarchitecture 20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeRace of CISC vs. RISCIn modern CPUs Advanced µ-Architecture Techniques minimize theadvantages of RISC over CISC• Branch Prediction • Reduces the effect of extra pipeline stages• Register Renaming • Effectively Increase the Number of Registers• Out Of Order • Reduce Number of stalls caused by shortage of registers• Speculative Execution • Further Reduce Number of stalls• Power saving features • Reduce the overhead when not needed. Intel® Processor Micro-architecture - Core® microarchitecture 21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software Collegeop – Intel’s Take of the CICS/RISC Race(CISC) Instructions are translated into one or more (RISC)uop(micro-operation)s• Fixed format• Wide and simple• Temp registersUsually one uop per instructionComplex instruction can be thousands of uopsStores divided into two uops (STA and STD)Fusion play games here Intel® Processor Micro-architecture - Core® microarchitecture 22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegePower and EnergyMaximum power (TDP):• Cooling requirements• Cooling solution• Computer form factor and acoustic noiseAverage power• Battery life• Electricity billGeneral calculation:• P = frequency * voltage^2 * activity factor * capacitance + leakageReducing TDP• Less transistors and wires• Smaller transistors and wires• Power features less activity• Low leakage transistorsReducing average power• Energy efficiency• Power states• Lower leakage Intel® Processor Micro-architecture - Core® microarchitecture 23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeDual/Multi Core and SMT Put more than one core per package Architectural change: • Software must be multi-threaded or multi-process • …but backward compatible with multiprocessor systems (MP) Several ways of implementing it • All of them being used I/O I/O I/O I/O LLC LLC LLC LLC LLC Core Core Core Core Core Core SMT: Run two (or more) threads on the same core, simultaneously Intel® Processor Micro-architecture - Core® microarchitecture 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel Approach ? Intel® Intel® XQ6700* Intel® Intel® Core 2 Duo® Duo® Intel® Intel® Pentium® D Pentium® Processor 80 Threads Intel® Intel® Pentium® Pentium® With HT Intel® Intel® 4 Threads Pentium® Pentium® 2 Threads State 2 Threads Execution Units Cache Bus 2 Threads 1 Threads Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006 While single core performance has increased due to clock speed, While single core performance has increased due to clock speed, increased cache and improved ILP the biggest performance increases increased cache and improved ILP the biggest performance increases have come from the thread level parallelism. have come from the thread level parallelism. Intel® Processor Micro-architecture - Core® microarchitecture 25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeA “Acronym Cheat Sheet” of ParallelComputingCMP: Chip Multi Processor (two or more cores per package)• Dual Core: two cores in same package• Quad Core: four cores in same packageDP: Dual Processor (two packages)MP: Multi Processor (four or more packages)SMT: Symmetric Multi Threading (virtual multi core: HyperThreading) Intel® Processor Micro-architecture - Core® microarchitecture 26 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable features• Wide Dynamic Execution• Smart Memory Access• Advanced Smart Cache• Advanced Digital Media Boost• Intelligent Power CapabilityMicro-architecture tourCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 27 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures Instruction FetchIntel® Wide Dynamic Execution and PreDecode• 14-stage efficient pipeline Instruction Queue 2M/4M • Wider execution path 5 shared L2 • Advanced branch prediction uCode ROM Decode Cache • Macro-fusion 4 • Roughly ~15% of all instructions are conditional branches up to • Macro-fusion fuses a comparison Rename/Alloc and jump to reduce micro-ops 10.4 Gb/s running down the pipeline FSB • Micro-fusion Retirement Unit 4 • Merges the load and operation (ReOrder Buffer) micro-ops into one macro-op• 64-Bit Support Schedulers ALU ALU ALU • Merom, Conroe, and Woodcrest Branch FAdd FMul support EM64T MMX/SSE MMX/SSE MMX/SSE Load Store FPmove FPmove FPmove L1 D-Cache and D-TLB Intel® Processor Micro-architecture - Core® microarchitecture 28 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Intel® Advanced Memory Access• Improved prefetching• Memory disambiguation • Advance load before a possible data dependency (pointer conflict) • Earlier loads hide memory latencies Intel® Processor Micro-architecture - Core® microarchitecture 29 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Intel® Advanced Smart Cache• Multi-core optimization • Shared between the two cores • Advanced Transfer Cache architecture • Reduced bus traffic • Both cores have full access to the entire cache • Dynamic Cache sizing Intel® Processor Micro-architecture - Core® microarchitecture 30 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Advantages of Shared Cache Memory Front Side Bus (FSB) Shipping L2 Cache Line ~Half access to memory Cache Line CPU1 CPU2 Intel® Processor Micro-architecture - Core® microarchitecture 31 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Advantages of Shared Cache (cont.) Memory Front Side Bus (FSB) L2 is shared: No need to ship cache line Cache Line CPU1 CPU2 Intel® Processor Micro-architecture - Core® microarchitecture 32 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Intel® Advanced Digital Media Boost SIMD Operation (SSE/SSE2/SSE3/SSSE)• Single Cycle SIMD Operation SOURCE 127 0 • 8 Single Precision Flops/cycle X4 X3 X2 X1 • 4 Double Precision Flops/cycle SSE/2/3 OP• Wide Operations Y4 Y3 Y2 Y1 • 128-bit packed Add DEST • 128-bit packed Multiply Core™ µarch • 128-bit packed Load CLOCK X4opY4 X3opY3 X2opY2 X1opY1 • 128-bit packed Store CYCLE 1• Support for Intel® EM64T Previous CLOCK X2opY2 X1opY1 CYCLE 1 instructions CLOCK X4opY4 X3opY3 CYCLE 2 Intel® Processor Micro-architecture - Core® microarchitecture 33 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeaturesIntel® Advanced Digital Media Boost• Additional Media Instructions - Supplemental Streaming SIMD Extensions 3 (SSSE3) • 16 new packed integer instructions • Targeting video encode/decode• Significantly improved strings • REP MOVS and REP STOS • ~8 bytes / cycle throughput • mileage may vary Intel® Processor Micro-architecture - Core® microarchitecture 34 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeaturesIntel® Advanced Digital Media Boost• Supplemental SSE-3 (SSSE-3)Horizontal Addition/Subtraction PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD Packed Absolute Values PABSB, PABSW, PABSD Multiply and Add Packed Signed/Unsigned bytes PMADDUBSW Packed multiply High with Round and Scale PMULHRSW Packed Shuffle Bytes PSHUFB Packed SIGN PSIGNB/W/D Packed Align Right PALIGNR Intel® Processor Micro-architecture - Core® microarchitecture 35 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Intelligent Power Capability• Advanced power gating & Dynamic power coordination • Multi-point demand-based switching • Voltage-Frequency switching separation • Supports transitions to deeper sleep modes • Event blocking • Clock partitioning and recovery • Dynamic Bus Parking • During periods of high performance execution, many parts of the chip core can be shut off Intel® Processor Micro-architecture - Core® microarchitecture 36 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable featuresMicro-architecture tour• Front End• Out-Of-Order Execution Core• Memory Sub-systemCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 37 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture Drill-down page miss handler store icache branch address integer predictionpredecode unit data memory FP load SIMD cache orderinstruction unit buffer store (3x) queue data instruction register Reservation decode alias table Station MS ALLOC Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 38 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge refreshmentNotable featuresMicro-architecture tour• Front End• Out-Of-Order Execution Core• Memory Sub-systemCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 39 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Front EndInstruction preparation before executed icache branch• Instruction Fetch Unit prediction predecode unit• Instruction Queue• Instruction Decode Unit• Branch Prediction Unit instruction queue instruction decode MS Intel® Processor Micro-architecture - Core® microarchitecture 40 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Queue Buffer between instruction pre-decode unit and decoder • up to six predecoded instructions written per cycle • 18 Instructions contained in IQ • up to 5 Instructions read from IQ Potential Loop cache Loop Stream Detector (LSD) support • Re-use of decoded instruction • Potential power saving Intel® Processor Micro-architecture - Core® microarchitecture 41 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Macro - Fusion Scheduler Roughly ~15% of all instructions are cmpjae eax, [mem], label conditional branches. Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction. Execution Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch. Branch Eval Not supported in EM64T long mode flags and target to Write back Intel® Processor Micro-architecture - Core® microarchitecture 42 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Macro-Fusion Absent Instruction Queue addps xmm0, [EAX+16] Read four instructions from mulps xmm0, xmm0 Instruction Queue Each instruction gets decoded movps [EAX+240], xmm0 into separate uops cmp eax, 100000 Enabling Example jge label for (int i=0; i<100000; i++) { … addps xmm0, [EAX+16] dec0 Cycle 1 } mulps xmm0, xmm0 dec1 movps [EAX+240], xmm0 dec2 cmp eax, 100000 dec3 Cycle 2 jge label dec0 Intel® Processor Micro-architecture - Core® microarchitecture 43 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Macro-Fusion Presented Instruction Queue addps xmm0, [EAX+16] Read five Instructions from Instruction Queue mulps xmm0, xmm0 Send fusable pair to single movps [EAX+240], xmm0 decoder cmp eax, 100000 Single uop represents two instructions jae label Enabling Example for (unsigned int i=0; Cycle 1 addps xmm0, [EAX+16] dec0 i<100000; i++) { mulps xmm0, xmm0 dec1 … movps [EAX+240], xmm0 dec2 } cmpjae eax, 100000, label dec3 Intel® Processor Micro-architecture - Core® microarchitecture 44 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Micro-Op Fusion Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation Micro-op fusion effectively widens the pipeline Intel® Processor Micro-architecture - Core® microarchitecture 45 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Micro-Fusion (cont.) u-ops of a Store “movps [EAX+240], xmm0” sta eax+240 st xmm0, [eax+240] std xmm0, [eax+240] Intel® Processor Micro-architecture - Core® microarchitecture 46 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Branch Prediction Improvements Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements: Indirect Branch Predictor Loop Detector Branch miss-predictions reduced by >20% Intel® Processor Micro-architecture - Core® microarchitecture 47 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable featuresMicro-architecture tour• Front End• Out-Of-Order Execution Core• Memory Sub-systemCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 48 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Execution Core storeAccepted decoded u-ops, assign resources, address integerexecute and retire u-ops FP load• Renamer SIMD store data (3x)• Reservation station (RS) register Reservation• Issue ports alias table Station• Execution Unit ALLOC Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 49 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core Execution Core Building Blocks Renamer Ports (number) RS 0,1,5 0,1,5 SIMD/Integer 0,1,5 SIMD Floating MUL Integer ROB Integer Point Execution Unit 2 Load 3,4 Store Memory Sub-system Intel® Processor Micro-architecture - Core® microarchitecture 50 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core Issue Ports and Execution Units 6 dispatch ports from RS • 3 execution ports • (shared for integer / fp / simd) • load • store (address) • store (data) 128-bit SSE implementation • Port 0 has packed multiply (4 cycles SP 5 DP pipelined) • Port 1 has packed add (3 cycles all precisions) Intel® Processor Micro-architecture - Core® microarchitecture 51 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core Retirement Unit ReOrder Buffer (ROB) • Holds micro-ops in various stages of completion • Buffers completed micro-ops • updates the architectural state in order • manages ordering of exceptions register Reservation alias table Station ALLOC Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 52 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable featuresMicro-architecture tour• Front End• Out-Of-Order Execution Core• Memory Sub-systemCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 53 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Memory Sub-SystemMemory Ordering Buffer• Store Address Buffer • Stores the address of each store not actually performed • Loads compare address to any store older than itself • If it find a hole…• Store Data Buffer • Stores data of each store not actually performed • If load hit on the SAB, it forward the data from here• Load Buffer • Stores address of non-retired loads • For snoops and re-dispatch• One 128-bit load and one 128-bit store per cycle to different memory locations• Out of order Memory operations Intel® Processor Micro-architecture - Core® microarchitecture 54 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Core® Micro-architecture Memory Sub- System (cont.) 32k D-Cache (8-way, 64 byte line size) Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache Cache to cache transfer • improves producer / consumer style MP Wider interface to L2 • reduced interference • processor line fill is 2 cycles Core1 Core2 Higher bandwidth from the L2 cache to the core • ~14 clock latency and 2 clock throughput Load & Store Access order Bus 1. L1 cache of immediate core 2. L1 cache of the other core 2 MB L2 Cache 3. L2 cache 4. Memory Intel® Processor Micro-architecture - Core® microarchitecture 55 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Enhanced Data Pre-fetch Logic Speculates the next needed data and loads it into cache by HW and/or SW Door Valet Parking Area Main Parking Lot (L1 Cache) (L2 Cache) (External Memory) Intel® Processor Micro-architecture - Core® microarchitecture 56 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Enhanced Data Pre-fetch Logic (cont.) • L1D cache prefetching • Data Cache Unit Prefetcher • Known as the streaming prefetcher • Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors cache • Instruction Based Stride Prefetcher • Prefetches based upon a load having a regular stride • Can prefetch forward or backward 2 Kbytes • 1/2 default page size • L2 cache prefetching: Data Prefetch Logic (DPL) • Prefetches data to the 2nd level cache before the DCU requests the data • Maintains 2 tables for tracking loads • Upstream – 16 entries • Downstream – 4 entries • Every load is either found in the DPL or generates a new entry • Upon recognition of the 2nd load of a “stream” the DPL will prefetch the next load Intel® Processor Micro-architecture - Core® microarchitecture 57 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Memory Disambiguation Memory Disambiguation predictor • Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible • increasing the performance of OOO memory pipelines Disambiguated loads checked at retirement • Extension to existing coherency mechanism • Invisible to software and system Intel® Processor Micro-architecture - Core® microarchitecture 58 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Memory Disambiguation Absent Load4 must WAIT until previous stores complete Memory Data W Store1 Y Load2 Y Data Z Store3 W Load4 X Data Y Data X Intel® Processor Micro-architecture - Core® microarchitecture 59 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Memory Disambiguation Presented Loads can decouple from stores Load4 can get its data WITHOUT waiting for stores Memory Data W Load4 X Store1 Y Load2 Y Data Z Store3 W Data Y Data X Intel® Processor Micro-architecture - Core® microarchitecture 60 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Stores Forwarding If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load Memory Store1 Y Internal Load2 Y Buffers Data Y Intel® Processor Micro-architecture - Core® microarchitecture 61 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAdvanced Memory Access / StoresForwarding: Aligned Store Cases store 16 store 32 bit store 64 bit load 16 load 32 bit load 64 bit ld 8 ld 8 load 16 load 16 load 32 bit load 32 bit ld 8 ld 8 ld 8 ld 8 load 16 load 16 load 16 load 16 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 store 128 bit load 128 bit load 64 bit load 64 bit load 32 bit load 32 bit load 32 bit load 32 bit load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16 ld 8 ld 8 ld 8 ld 8 ld 8 Intel® Processorld 8 ld 8 ld -8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 Micro-architecture Core® microarchitecture 62 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAdvanced Memory Access / StoresForwarding: Unaligned CasesNote that unaligned store forward does not occur when the loadcrosses a cache line boundary store 16 store 32 bit store 64 bit load 16‡ load 32 bit‡ load 64 bit ld 8 ld 8 load 16‡ load 16 load 32 bit‡ load 32 bit ld 8 ld 8 ld 8 ld 8 load 16‡ load 16 load 16 load 16 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 Store forwarded to load Note: Unaligned 128-bit stores ld 8 No forwarding are issued as two 64-bit stores. ‡: This provides two alignments for No forwarding if the load store forwarding crosses a cache line boundary Intel® Processor Micro-architecture - Core® microarchitecture 63 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable featuresMicro-architecture tourCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 64 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing forInstruction Fetch and PreDecodeAvoid “Length Changing Prefixes” (LCPs)• Affects instructions with immediate data or offset• Operand Size Override (66H)• Address Size Override (67H) [obsolete]• LCPs change the length decoding algorithm – increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16-byte boundary)• The REX (EM64T) prefix (4xH) is not an LCP • The REX prefix does lengthen the instruction by one byte, so use of the first eight general registers in EM64T is preferred Intel® Processor Micro-architecture - Core® microarchitecture 65 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing forInstruction QueueIncludes a “Loop Stream Detector” (LSD)• Potentially very high bandwidth instruction streaming• A number of requirements to make use of the LSD • Maximum of 18 instructions in up to four 16-byte packets • No RET instructions (hence, little practical use for CALLs) • Up to four taken branches allowed • Most effective at 70+ iterations• LSD is after PreDecode so there is no added cost for LCPs• Trade-off LSD with conventional loop unrolling Intel® Processor Micro-architecture - Core® microarchitecture 66 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing forDecodeDecoder issues up to 4 uOps for renaming/ allocation per clock• This creates a trade off between more complex instruction uOps versus multiple simple instruction uOps• For example, a single four uOp instruction is all that can be renamed/allocated in a single clock• In some cases, multiple simple instructions may be a better choice than a single complex instruction• Single uOp instructions allow more decoder flexibility • For example, 4-1-1-1 can be decoded in one clock • However, 2-2-2-1 takes three clocks to decode Intel® Processor Micro-architecture - Core® microarchitecture 67 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing forExecutionUp to six uOps can be dispatched per clock• “Store Data” and “Store Address” dispatch ports are combined on the block diagramUp to four results can be written back per clockSingle clock latency operations are best• Differing latency operations can create writeback conflicts• Separate multiple-clock uOps with several single uOp instructions • Typical instructions here: ADC/SBB, RWM, CMOVcc• In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility)When equivalent, PS preferred to PD (LCP)• For example, MOVAPS over MOVAPD, XORPS over XORPD Intel® Processor Micro-architecture - Core® microarchitecture 68 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing forExecution (cont.)Bypass register “access” preferred to register readsPartial register accesses often lead to stalls• Register size access that ‘conflicts’ with recent previous register write• Partial XMM updates subject to dependency delays• Partial flag stall can occur, too much higher cost • Use TEST instruction between shift and conditional to prevent• Common zeroing instructions (e.g., XOR reg,reg) don’t stallAvoid bypass between execution domains• For example: FP (ADDPS) and logical ops (PAND) on XMMnVectorization: careful packing/unpacking sequence• Use MXCSR’s FZ and DAZ controls as appropriate Intel® Processor Micro-architecture - Core® microarchitecture 69 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing forMemorySoftware prefetch instructions• Can reach beyond a page boundary (including page walk)• Prefetches only when it completes without an exceptionGeneral techniques to help these prefetchers• Organize data in consecutive lines• In general, increasing addresses are more easily prefetched Intel® Processor Micro-architecture - Core® microarchitecture 70 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSummaryWhat has been covered• Notable features of Core® Micro-architecture • Wide Dynamic Execution • Advanced Memory Access • Advanced Smart Cache • Advanced Digital Media Boost • Power Efficient Support• Core® Micro-architecture components • Front End • OOO execution core • Memory sub-system Intel® Processor Micro-architecture - Core® microarchitecture 71 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Intel® Processor Micro-architecture - Core® microarchitecture 72 Copyright © 2006, Intel Corporation. All rights reserved.Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegePlatform Legacy & Debug I/OIntel provides most of the silicon Core LLCon any computer Core FSB CPUClassical platform partition• CPU – Computation FSB HD video• MCH – high speed IO ME MEM DDR Graphics• ICH – low speed IO PCIe PEG Display TVout AnalogGraphics speed and memory DMI MCHlatencies will require differentpartition Wireless DMIThis presentation focuses on the PCI (IO) SATAcore microarchitecture USB KBRD ICH others Intel® Processor Micro-architecture - Core® microarchitecture 73 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® 64 = Extending IA-32 to 64 Bit Extended Memory Extended Memory Addressability Addressability 64-Bit Pointers, Registers 64-Bit Pointers, Registers + Additional Registers Additional Registers 8-SSE & 8-Gen Purpose 8-SSE & 8-Gen Purpose = With 64-Bit Double Precision (64-bit) Double Precision (64-bit) Extension Integer Support Integer Support Technology Added to Intel XEON™ and Pentium® 4 Processor in 2004; today available in all main stream Intel IA-32 processors – in particular in all processors based on Intel® Core™ Architecture Intel® Processor Micro-architecture - Core® microarchitecture 74 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® 64 - New Modes of Operation Compile New Features Defaults required Mode OS Req’d 64- RIP New GPR Addr Operand bit Rel. Regs Widt Size Size IP h 64-bit Yes Yes Yes Yes 64 64 32 Mode Long New 32 32 Mode 64-bit Compa OS No Yes No No 32 tibility 16 16 Mode Legacy Mode Legac 32 32 y 32- (IA32 Mode) bit or No No No No 32 16-bit 16 16 OS Intel® Processor Micro-architecture - Core® microarchitecture 75 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Registers : Extensions and Additions RIP EIP 63 32 31 0 127 64 63 0 RAX EAX XMM0 RBX EBX XMM1 RCX ECX XMM2 RDX EDX XMM379 0 RBP EBP XMM4 RSI ESI XMM5 RDI EDI XMM6 RSP ESP XMM7 R8 XMM8 R9 XMM9 R10 XMM10 R11 XMM11 R12 XMM12 XMM13 R13 XMM14 X87/ R14 R15 XMM15 MMX Intel® Processor Micro-architecture - Core® microarchitecture 76 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeRegisters : Availability in differentmodes Intel® Processor Micro-architecture - Core® microarchitecture 77 Copyright © 2006, Intel Corporation. All rights reserved.Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College64-bit Mode of OperationDefault data size is 32-bits• Override to 64-bits using new REX prefixAll registers are 64-bit, 32-bit, 16-bit and 8-bit addressableREX prefixes• A family of 16 prefixed, encoded 0x40-0x4F• Allows the use of general purpose registers as 64-bits• Allows the use of new registers (like r8-r15)Instructions that set a 32 bit register automatically zero extendthe upper 32-bits Intel® Processor Micro-architecture - Core® microarchitecture 78 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeREX PrefixA new instruction-prefix byte used in 64-bit mode • Specify the new GPRs and SSE registers • Specify a 64-bit operand size. • Specify extended control registers (used by system software)An instruction can only have one REX prefix and if used, must immediately precede the opcode or the two-byte opcode escape prefix .The legacy instruction-size limit of 15 bytes still applies to instructions that contains a REX prefix. Intel® Processor Micro-architecture - Core® microarchitecture 79 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegePhysical and Linear Addressing Linear Addressing • Initial Intel® 64 implementation support 48 bits of Virtual addressing. • Addresses are required to be in canonical form – bits 47 thru 63 must all be 1 or all be 0. Physical Addressing • Initial Netburst™ Intel® 64 implementation support 36 bit, today all current processors support 40bit at least • Entries in page tables expanded for up to 52 bits of physical address. Intel® Processor Micro-architecture - Core® microarchitecture 80 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel®64 - Large Memory ConsiderationsCanonical addressing for 64 bit addresses• Although the architecture now allows calculating flat addresses to 64 bits, today’s processors limit virtual addressing to 48 bits• Canonical address definition: An address that has address bit 63 through 47 set to either all ones or all zeros• Canonical addresses are a requirement • Values for addresses that are not canonical will cause faults when put into locations expecting a valid address, such as segment registers Return Intel® Processor Micro-architecture - Core® microarchitecture 81 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Introducing SIMD: Single Instruction Multiple DataScalar processing SIMD processing • traditional mode • with SSE / SSE2 • one operation produces • one operation produces one result multiple results X X x3 x2 x1 x0 + + Y Y y3 y2 y1 y0 X+Y X+Y x3+y3 x2+y2 x1+y1 x0+y0 Intel® Processor Micro-architecture - Core® microarchitecture 82 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeX86 Register SetsSSE-Registers introduced first in Pentium® 3 IA-INT MMX™ Technology / SSE Registers Registers IA-FP Registers 80 128 32 64 xmm0 eax st0 mm0 … xmm7 edi st7 mm7 Eight 128-bit registers Eight 80/64-bit registers Hold data only:Fourteen 32-bit registers 4 x single FP numbers Hold data onlyScalar data & addresses 2 x double FP numbers Stack access to FP0..FP7Direct access to regs 128-bit packed integers Direct access to MM0..MM7 No MMX™ Technology / FP Direct access to the registers interoperability Use simultaneously with FP / MMX Technology Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeInstruction Set Extensions New Instructions Added to Intel® Processors160 144140120100 80 70 56 ~ 50 60 32 40 32 20 13 0 Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+ Future MMX™ Streaming SIMD Streaming SIMD Streaming SIMD Supplemental SSE3 SSE-4 Future Intel instruction Extensions (SSE) Extensions 2 (SSE2) Extensions 3 (SSE3) (SSSE3) set extensionsProcess (nm) 350 250 180 90 65 45 45 nm Beginning in 2008: ~50 new instructions in 13 groups All function in 32-bit and 64-bit modes Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D & 3D Imaging, Vectorizing Compiler Performance Intel® Processor Micro-architecture - Core® microarchitecture 84 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSSE and SSE-2 Data TypesSSE 4x floats 2x doubles 16x bytes 8x 16-bit shortsSSE-2 4x 32-bit integers 2x 64-bit integers 1x 128-bit(!) integer Intel® Processor Micro-architecture - Core® microarchitecture 85 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSSE-Instructions Set ExtensionsIntroduced by Pentium® 3 in 1999; now frequently calledSSE-1Only new data type supported: 4x32Bit (Single Precision)floating point dataSome 70 instructions• Arithmetic, compare, convert operations on SSE SP FP data • PACKED, UNPACKED• Data load/store• Prefetch• Extension of MMX• Streaming Store (store without using cache in between)• … 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSSE Sample: Branch Removal R = (A < B)? C : D //remember: everything packed A 0.0 0.0 -3.0 3.0 cmplt B 0.0 1.0 -5.0 5.0 00000 11111 00000 11111 and nand c3 c2 c1 c0 d3 d2 d1 d000000 c2 00000 c0 d3 00000 d1 00000 or Intel® Processor Micro-architecture - Core® microarchitecture 87 d3 Copyright © 2006, Intel Corporation. All rights reserved. c2 d1 c0 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSSE-2 Instructions Set Extensions Introduced by Intel® Pentium®4 processor in 2000 Some 140 new instructions Added double precision floating point data (2x64Bit) and all related instructions including conversion Again some extensions to MMX Added all possible combinations of integer data to SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related operations 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSIMD Single vs. SIMD Double SIMD SP FP Operand = 4 Elements 4 x Single Precision: Element = SP FP Number SSE-1 127 0 X3 X2 X1 X0 31 30 23 22 0 S Exponent Significand SIMD DP FP Operand = 2 Elements 2 x Double Precision: Element = DP FP Number SSE-2 127 0 X1 X0 63 62 52 51 0 S Exponent Significand Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Sample for SSE-2: SIMD Double ↔ SIMD Int Conversion SIMD Double SIMD Int: conversion to two lower ints, two higher ints cleared x1 x0 __m128d x; __m128i ix; ix = _mm_cvtpd_epi32(x);00000 00000 (int)x1 (int)x0 SIMD Int SIMD Double: conversion from two lower ints???? ???? ix1 ix0 x = _mm_cvtepi32_pd(ix); Intel® Processor Micro-architecture - Core® microarchitecture 90 (double)x1 (double)x0 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSSE3: No new Data Types but new Instructions FISTTP FP to integer conversions ADDSUBPD, ADDSUBPS, Complex arithmetic MOVDDUP, MOVSHDUP, MOVSLDUP Video encoding SIMD FP using AOS LDDQU format* HADDPD, HSUBPD Thread Synchronization HADDPS, HSUBPS MONITOR, MWAIT * Also benefits Complex and Vectorization Intel® Processor Micro-architecture - Core® microarchitecture 91 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeStreaming SIMD Extensions 313 new instructionsThree have limited use for application performanceimprovement• FISTTP - X87 to integer conversion (requires –longdouble switch)• MONITOR/MWAIT - thread synchronization • Available today in Ring 0 only; being used by newer Windows* and Linux* thread packagesThe other ten have some potential for specifcapplication domains Intel® Processor Micro-architecture - Core® microarchitecture 92 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSSE-3 Sample Complex Arithmetic: ADDSUBPSADDSUBPS OperandA OperandB• OperandA (xmm register; 4 data elements) • a3, a2, a1, a0• OperandB (xmm reg. Or memory addr; 4 data elements) • b3, b2, b1, b0• Result (Stored in OperandA) • a3+b3, a2-b2, a1+b1, a0-b0__m128 _mm_addsub_ps(__m128 a, __m128 b) a3 a2 a1 a0 b3 b2 b1 b0 Add Sub Add Sub Intel® Processor Micro-architecture - Core® microarchitecture 93 a3+b3 a2-b2 a1+b1 a0-b0 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeSample SSSE-3 Inst.: Byte PermutePSHUFB mm, mm/m64PSHUFB xmm, xmm/m128 • A complete byte-granularity permutation • The source operand is used as the control field (variable control) • The destination operand gets permuted • Each byte of the source field selects the origin of the corresponding destination byte • Also includes force-byte-to-zero flag (bit 7) src 0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00 dest 0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01 dest 0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01 Intel® Processor Micro-architecture - Core® microarchitecture 94 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeWays to SSE/SIMD programmingCoding using SSE/SSE2/3/4 assembler instructions• Very tedious (manually schedule) – discouraged: Don’t do it !• E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ?Intel® compiler’s C/C++ SIMD intrinsics• No need to take care of register allocation, scheduling etcIntel® compiler’s C++ Vector Class Library• Use this if you are heavy into C++ classesVectorizer of Intel® C++ and Fortran Compilers• Recommended for most cases – easy and efficientUse ready-to-go vectorized code from a library likeIntel® Math Kernel Library (MKL) 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Compiler Based Vectorization Intel® Software CollegeProcessor Specific Generate Code and Optimize for Linux* Pentium® 3 compatible and Athlon XPprocessors including code generation for -axK MMX and SSE -axK Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, -xW including code generation for MMX, SSE and SSE2 -axW Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2 -xN - depreciated switch: use xW instead -axN Pentium® M processors including code generation for MMX, SSE and SSE-2 -xB -axB Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit -xP, mode) – including code generation for MMX, SSE, SSE2 and SSE-3 -axP Intel® processors with MNI capability – Intel® Core™2 Duo processors ( -xT, Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE- -axT 3 and MNI Intel® Processor Micro-architecture - Core® microarchitecture 96 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.) New Instructions ReturnInstruction name Descriptionpsignb/w/d mm, mm/m64 Per element, if the source operand ispsignb/w/d xmm, xmm/m128 negative, multiply the destination operand by -1.pabsb/w/d mm, mm/m64 Per element, overwrite destination withpabsb/w/d xmm, xmm/m128 absolute value of source.phaddw/d/sw mm, mm/m64 Pairwise integer horizontal addition + pack.phaddw/d/sw xmm, xmm/m128phsubw/d/sw mm, mm/m64 Pairwise integer horizontal subtract + pack.phsubw/d/sw xmm, xmm/m128PMADDUBSW mm, mm/m64 Multiply signed & unsigned bytes.PMADDUBSW xmm, xmm/m128 Accumulate result to signed-words. (Multiply Accumulate)PMULHRSW mm, mm/m64 Signed 16 bits multiply, return high bits.PMULHRSW xmm, xmm/m128PSHUFB mm, mm/m64 A complete byte-granularity permutation,PSHUFB xmm, xmm/m128 including force-to-zero flag.PALIGNR mm, mm/m64, imm8 Extract any continuous 16 (8 in the 64 bitPALIGNR xmm, xmm/m128, imm8 case) bytes from the pair [dst, src] and Intel® Processor Micro-architecture - Core® microarchitecture store them to the dst register. 97 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeDependencies and Bypasses “Read-after-Write” Dependency - 1 clock stall assuming register file can be written-through add eax, ecx eax F D E W sub ebx, eax ebx F D D E W “E to D” Bypass - save clock penalty add eax, ecx eax F D E W sub ebx, eax ebx F D E W Long Latency operations Load [ecx+edi] eax F D E E E W add ebx, eax ebx F D D D E W Intel® Processor Micro-architecture - Core® microarchitecture 98 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting Stalls: Branch HandlingGiven the code:for (i=100, a=0; i>0; i--) a+=B[i];Compiler would generate• // eax initiated with zero, edi initiated with 100loop: load B[edi] ebx // read B[i] from memory add eax, ebx eax // a+=B[i] add edi,-1 edi // i-=1 jnz edi, loop store eax a // store result Intel® Processor Micro-architecture - Core® microarchitecture 99 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting Stalls: Branch Handling (cont.)load B[edi] ebx F D E Wadd eax,ebx eax F D E Wadd edi,-1 edi F D E Wjnz edi, loop F D E Wstore eax a F D E Wxxx F D E Wload B[edi] ebx F D E WOnly after branch Execute stage we know that next fetch was wrong• Need to flush the pipe• IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC = 1)• ‘Pipe break’ penalty = 2 clocks• Adding a stage?: IPC = 0.57 ~14% slower!!! Prolonging the pipeline achieves higher frequencies however pipe break penalty increases! MUST solve the pipe break penalty problem! Intel® Processor Micro-architecture - Core® microarchitecture 100 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting Stalls: Branch Handling (cont.)H/W can ‘learn’ about SW behavior• Same branch goes same direction in most cases• Learn branch address and target • Branch Target Buffer (BTB)• Predict based on branch history, surrounding branch behavior, loop behavior. • We are at ~95% correct prediction.• Looks in BTB while fetching instruction• Lee&Smith or Yeh&Patt algorithmsNew (and correct) pointer calculated in Fetch stage of branchload B[edi] ebx F D E Wadd eax,ebx eax F D E Wadd edi,-1 edi F D E Wjnz edi, loop F/P D E Wload B[edi] ebx F D E W Intel® Processor Micro-architecture - Core® microarchitecture 101 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAdvanced Pipeline TechniquesLimitations of the Typical Pipeline Scheme• IPC is theoretically limited by 1 • Actually IPC is less than 1 because of long latency operations, stalls (e.g. cache miss), pipeline flushes (due to branch miss prediction) etc.• Pipeline stages are frequently not balanced • Cycle Time (Tc) is determined by the longest pipeline stageAdvanced Pipeline Techniques• Super pipeline• Super-scalar Intel® Processor Micro-architecture - Core® microarchitecture 102 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAdvanced Pipeline Techniques (cont.)Super pipeline: shorter stages allows higher frequency F1 F2 D1 D2 E1 E2 W1 W2 F1 F2 D1 D2 E1 E2 W1 W2 F1 F2 D1 D2 E1 E2 W1 W2Super-scalar: perform more in a single cycle F D E W F D E W F D E W F D E W Intel® Processor Micro-architecture - Core® microarchitecture 103 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting stalls: Out Of Order Execution(OoO)Instructions are executed based on “data flow” rather than program order (Tomasulo’s algorithm ) Avoid the stall that1. Instruction Fetch and Decode. occurs on this stage in an in-order2. Instruction queue @ Reservation Station. processor3. Instruction • waits in the queue until all input operands are available • leaves the queue before earlier, older instructions.4. Instruction Execution5. Results are queued.6. Instruction Reorder and Writeback. Intel® Processor Micro-architecture - Core® microarchitecture 104 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting stalls: Register RenamingCreates new opportunities for OOO execution• Eliminates Write-after-write (WAW) and Write-after- read (WAR) dependencies = hazards.Architectural vs physical registers dispatch 1. mov eax, [m1] 2. add eax, 2MULTD F4,F2,F2 reads from F2 3. mov [m2], eax 4. mov eax, [m3]ADDD F2,F0,F6 writes to F2 5. add eax, 4 6. mov [m4], eaxMULTD F4,F2,F2ADDD 5, 6 can be executed in parallel with 1, 2, 3 4, F8,F0,F6 (assume F8 is unused) but after registers renaming only!!! Intel® Processor Micro-architecture - Core® microarchitecture 105 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting Stalls: Re-Order Buffer (ROB)Mechanism for renaming and retirementTable contains in-order instructions order instructions• Instructions are entered in order• Registers renamed by the entry number• Once assigned: execution order unimportant• After execution: entries marked• An executed entry can be “retired” once all prior instruction have retired. That is: instruction have retired - • Update “real registers real registers” with value of renamed regs • Update memory • Leave the ROB Intel® Processor Micro-architecture - Core® microarchitecture 106 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting Stalls: Reservation Station(s)Pool(s) of all “not yet executed” instructionsMaintains operands status “ready / not-ready”Each cycle, executed instructions make more operands “ready”Instructions whose all operands are “ready” can be “dispatched”for executionDispatcher chooses which of the “ready” instructions will beexecuted next Intel® Processor Micro-architecture - Core® microarchitecture 107 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFighting Stalls: Memory Order Buffer (MOB)Idea - allow out of order among memory operationsProblem Memory dependencies cannot fully resolved statically(memory disambiguation)Structure similar in concept to ROBEvery access is allocated an entryAddress & data (for stores) are updated when knownLoad is checked against all previous stores: Load is checkedagainst all previous stores Return Intel® Processor Micro-architecture - Core® microarchitecture 108 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Intelligent Power Capability - Split Busses (core power feature) Many buses are sized for worst case data (x86 instruction of 15 bytes) (ALU can write-back 128 bits) Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 109 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture NotableFeatures (cont.)Intelligent Power Capability - Split Busses (core power feature) By splitting buses to deal with varying data widths, we can gain the performance benefit of bus width while maintaining C dynamic closer to thinner buses Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 110 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge refreshmentNotable featuresMicro-architecture drill-down• Front End• Out-Of-Order Execution Core• Memory Sub-systemCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 111 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture Overview System Bus Bus Unit 2nd Level Cache 1st Level Cache (Data)Instruction Decode Renamer/Allocator Execution Fetch Unit /IQ Buffers(Retirement) Unit Scheduler Front End Execution Core Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 112 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core® Micro-architecture Drill-down page miss handler store icache branch address integer predictionpredecode unit data memory FP load SIMD cache orderinstruction unit buffer store (3x) queue data instruction register Reservation decode alias table Station MS ALLOC Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 113 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeExample Code to Be Used…addps xmm0, [EAX+16]mulps xmm0, xmm0movps [EAX+240], xmm0cmp EAX, 100000jge label… Intel® Processor Micro-architecture - Core® microarchitecture 114 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge refreshmentNotable featuresMicro-architecture drill-down• Front End• Out-Of-Order Execution Core• Memory Sub-systemCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 115 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Front EndInstruction preparation before executed• Instruction Fetch Unit• Instruction Queue• Instruction Decode Unit• Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 116 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Front EndInstruction Fetch UnitInstruction QueueInstruction Decode UnitBranch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 117 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Fetch Unit Prefetches instructions that are likely to be icache executed branch prediction Caches frequently-used instructions predecode unit Predecodes and Buffers instructions instruction queue 2nd Level Cache 1st Level Cache (Data) instruction Instruction Fetch Unit IQ/ Decode Renamer/Allocator Buffers(Retirement) Execution Unit decode Scheduler Front End Execution Core BTBs/Branch Prediction MS Intel® Processor Micro-architecture - Core® microarchitecture 118 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Fetch Unit (cont.) I-Cache (Instruction Cache) • 32 KBytes / 8-way / 64-byte line • 16 aligned bytes fetched per cycle ITLB (Instruction Translation Lookaside Buffer) • 128 4k pages, 8 2M pages Instruction Prefetcher • 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers Instruction Pre-decoder • Instruction Length Decode (predecode) • Avoid Length Changing Prefix, for example • The REX (EM64T) prefix (4xH) is not an LCP Avoid in loop: MOV dx, 1234h Opcode ModR/M Instruction Prefixes (66H/67H)Intel® Processor Micro-architecture SIB microarchitecture ModR/M - Core® Displacement Immediate 119 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Front EndInstruction Fetch UnitInstruction QueueInstruction Decode UnitBranch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 120 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Queue Buffer between instruction pre-decode unit and decoder • up to six predecoded instructions written per cycle icache • 18 Instructions contained in IQ branch • up to 5 Instructions read from IQ prediction predecode unit Potential Loop cache Loop Stream Detector (LSD) support • Re-use of decoded instruction instruction • Potential power saving queue 2nd Level Cache 1st Level Cache (Data) instruction Instruction Fetch Unit IQ/ Decode Renamer/Allocator Buffers(Retirement) Execution Unit decode Scheduler Front End Execution Core BTBs/Branch Prediction MS Intel® Processor Micro-architecture - Core® microarchitecture 121 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Front EndInstruction Fetch UnitInstruction QueueInstruction Decode UnitBranch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 122 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode Decode the instructions into micro-ops icache Ready for the execution in OOO core branch prediction predecode unit instruction queue 2nd Level Cache 1st Level Cache (Data) instruction Instruction Fetch Unit IQ/ Decode Renamer/Allocator Buffers(Retirement) Execution Unit decode Scheduler Front End Execution Core BTBs/Branch Prediction MS Intel® Processor Micro-architecture - Core® microarchitecture 123 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 124 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Decoders Instructions converted to micro-ops (uops) • 1-uop includes load+op, stores, indirect jump, RET... 4 decoders:1 “large” and 3 “small” • All decoders handle “simple” 1-uop instructions • One large decoder handles instructions up to 4 uops All decoder working in parallel • Four(+) instructions / cycle Micro-Sequencer takes over for long flows (handling instruction contains 2~4 uops, uCodeRom handles more complex) Intel® Processor Micro-architecture - Core® microarchitecture 125 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Code Sequence in Front End cmp EAX, 100000 IQ these instructions took jne label more than one fetch as they are 22 bytes movps [EAX+240], xmm0 IQ buffers them together mulps xmm0, xmm0 addps xmm0, [EAX+16] all instructions are decodable by all small small small decoders Large (dec1) (dec2) (dec3) (dec0) CMP and adjacent JCC are “fused” into a single uop. up to 5 instructions cmpjne EAX, 100000, label decoded per cycle sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] Intel® Processor Micro-architecture - Core® microarchitecture 126 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 127 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Macro - Fusion Scheduler Roughly ~15% of all instructions are cmpjae eax, [mem], label conditional branches. Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction. Execution Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch. Branch Eval Not supported in EM64T long mode flags and target to Write back Intel® Processor Micro-architecture - Core® microarchitecture 128 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Macro- Instruction Queue Fusion Absent addps xmm0, [EAX+16] Read four instructions from mulps xmm0, xmm0 Instruction Queue Each instruction gets decoded movps [EAX+240], xmm0 into separate uops cmp eax, 100000 Enabling Example jge label for (int i=0; i<100000; i++) { … addps xmm0, [EAX+16] dec0 Cycle 1 } mulps xmm0, xmm0 dec1 movps [EAX+240], xmm0 dec2 cmp eax, 100000 dec3 Cycle 2 jge label dec0 Intel® Processor Micro-architecture - Core® microarchitecture 129 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Macro- Instruction Queue Fusion Presented addps xmm0, [EAX+16] Read five Instructions from Instruction Queue mulps xmm0, xmm0 Send fusable pair to single movps [EAX+240], xmm0 decoder cmp eax, 100000 Single uop represents two instructions jae label Enabling Example for (unsigned int i=0; Cycle 1 addps xmm0, [EAX+16] dec0 i<100000; i++) { mulps xmm0, xmm0 dec1 … movps [EAX+240], xmm0 dec2 } cmpjae eax, 100000, label dec3 Intel® Processor Micro-architecture - Core® microarchitecture 130 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Macro – Fusion (cont.) Benefits • Reduces latency • Increased renaming • Increased retire bandwidth • Increased virtual storage • Power savings Enabling Greater Performance & Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 131 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 132 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Micro-Op Fusion Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation Micro-op fusion effectively widens the pipeline Intel® Processor Micro-architecture - Core® microarchitecture 133 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Micro-Fusion (cont.) u-ops of a Store “movps [EAX+240], xmm0” sta eax+240 st xmm0, [eax+240] std xmm0, [eax+240] Intel® Processor Micro-architecture - Core® microarchitecture 134 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 135 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Instruction Decode / Stack Pointer Tracker (Extended Stack Pointer folding) ESP is calculated by dedicate logic PUSH EAX PUSH EDX POP EBX • No explicit Micro-Ops updating ESP • Micro-Ops saving Decoder 4 Decoder 0 Decoder ESPd=8 … • Power saving 0 1 N Recovery . Information . . Intel® Processor Micro-architecture - Core® microarchitecture 136 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Front EndInstruction Fetch UnitInstruction QueueInstruction Decode UnitBranch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 137 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Branch Prediction Unit Allow executing instructions long before the branch outcome is decided icache branch • Superset of Prescott / Pentium-M features prediction predecode unit • One taken branch every other clock • Branch predictions for 32 bytes at a time, twice the width of the fetch engine instruction queue 2nd Level Cache 1st Level Cache (Data) instruction Instruction Fetch Unit IQ/ Decode Renamer/Allocator Buffers(Retirement) Execution Unit decode Scheduler Front End Execution Core BTBs/Branch Prediction MS Intel® Processor Micro-architecture - Core® microarchitecture 138 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Branch Prediction Unit (cont.) 16-entry Return Stack Buffer (RSB) Front end queuing of BPU lookups Type of predictions • Direct Calls and Jumps • Indirect Calls and Jumps • Conditional branches Intel® Processor Micro-architecture - Core® microarchitecture 139 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Front End Branch Prediction Improvements Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements: Indirect Branch Predictor Loop Detector Branch miss-predictions reduced by >20% Intel® Processor Micro-architecture - Core® microarchitecture 140 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeAgendaIntroductionKnowledge preparationNotable featuresMicro-architecture drill-down• Front End• Out-Of-Order Execution Core• Memory Sub-systemCoding considerations Intel® Processor Micro-architecture - Core® microarchitecture 141 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeCore® Micro-architecture Execution Core storeAccepted decoded u-ops, assign resources, address integerexecute and retire u-ops FP load• Renamer SIMD store data (3x)• Reservation station (RS) register Reservation• Issue ports alias table Station• Execution Unit ALLOC Re-Order Buffer 2nd Level Cache 1st Level Cache (Data) IQ/ Renamer/Allocator ExecutionInstruction Decode Buffers(Retirement) UnitFetch Unit Scheduler Front End Execution Core BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 142 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core Execution Core Building Blocks Renamer Ports (number) RS 0,1,5 0,1,5 SIMD/Integer 0,1,5 SIMD Floating MUL Integer ROB Integer Point Execution Unit 2 Load 3,4 Store Memory Sub-system Intel® Processor Micro-architecture - Core® microarchitecture 143 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core Rename and Resources 4 uops renamed / retired per clock • one taken branch, any # of untaken • one fxchg per cycle Uops written to RS and ROB • Decoded uops were renamed and allocated with resource by RAT and sent to ROB read and RS • RS waits for sources to arrive allowing OOO execution • Registers not “in flight” read from ROB during RS write register Reservation alias table Station ALLOC Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 144 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core Issue Ports and Execution Units 6 dispatch ports from RS • 3 execution ports store • (shared for integer / fp / simd) integer address • load FP • store (address) load SIMD • store (data) store (3x) data 128-bit SSE implementation • Port 0 has packed multiply (4 cycles SP 5 DP pipelined) • Port 1 has packed add (3 cycles all precisions) FP data has one additional cycle bypass latency • Do not mix SSE FP and SSE integer ops on same register Avoid: Addps XMM0,XMM1 Better: Addps XMM0,XMM1 Pand xmm0,xmm3 Addps xmm2,xmm0 Addps xmm2,xmm0 Pand xmm0,xmm3 Intel® Processor Micro-architecture - Core® microarchitecture 145 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core The Out Of Order each uop only takes a single RS entry load + add dispatches twice (load, then add) mulps dispatches once when load + add to write back sta + std dispatches twice sta (address) can fire as early as possible std must wait for mulps to write back cmpjne dispatches only once (functionality is truly fused) no dependency, can fire as early as it wants cmpjne EAX, 100000, label RS sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] Intel® Processor Micro-architecture - Core® microarchitecture 146 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Execution Core Dispatching to OOO EXE cmpjne EAX, 100000, label sta_std [EAX+240], xmm0 RS 5 GP (incl jmp) mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] 4 STD cmpjne EAX, 100000, label sta_std [EAX+244], xmm0 mulps xmm0, xmm0, xmm0 3 STA load_add xmm0, xmm0, [EAX+16] cmpjne EAX, 100000, label 2 Load sta_std [EAX+248], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] 1 GP (incl FP add) cmpjne EAX, 100000, label sta_std [EAX+24C], xmm0 0 GP (incl FP mul) mulps xmm0, xmm0, Intel® Processor Micro-architecture - Core® microarchitecture xmm0 load_add xmm0, xmm0, [EAX+16] 147 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeIntel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access 3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2 Miss Latencies • L1 miss hits L2 ~ 10 cycles • L2 miss, access to memory ~300 cycles (server/FBD) • L2 miss, access to memory ~165 cycles (Desk/DDR2) • C step broadwater is reported to have ~50ns latency Cache Bandwidth • Bandwidth to cache ~ 8.5 bytes/cycle Memory Bandwidth • Desktop ~ 6 GB/sec/socket (linux) • Server ~3.5 GB/sec/socket Intel® Processor Micro-architecture - Core® microarchitecture 148 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing for Intel® Core™MicroarchitectureUse CMP = employ both Cores• Go to multithreading!Prefer SSE as much as possible. If you didn’t do it so far,vectorize the code now!! • Intel Compiler has very good vectorization engineAlign data and data layout (sequential)• To align use __declspec(align (16)) float a[1000]; Intel® Processor Micro-architecture - Core® microarchitecture 149 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOptimizing for Intel® Core™Microarchitecture (advanced)Use Intel VTune™ Performance Analyzer for performanceproblems revealing• CPI• Specific CPU events for Core-arch: RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI, RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etc- see VTune help Intel® Processor Micro-architecture - Core® microarchitecture 150 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeFront End Issue DebuggingLook for Front End optimization only when code is FE bound• Reservation station (RS) is the front end and allocation target• Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front end issue • If there are no issues in the FE the RS should be full above 30% of the timeFront End typical issues:• Code is too big to fit in the L1: • When L2_IFETCH.SELF.MESI happens every 10-15 instructions • Code that could have been with CPI 1 will be around 2 • 14 cycles penalty for L1 demand miss• Average instruction size above 6 bytes • Happens typically with SSE code and more with EM64T • Can have impact only in case of otherwise excellent CPI• Code with length changing prefix issues (LCP) • Penalty of 6 cycles or more • Look at ILD_STALL VTune event Front-End should not be the bottleneck. Focus on Front End issues only if it is the issue. Intel® Processor Micro-architecture - Core® microarchitecture 151 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeExecution micro architectureThe busiest port may determent the potential execution speedSingle clock latency operations are best• Different latency operations can create writeback conflicts Creating bubble in the portLook at the dependency chains to see the potentialparallelism• Remember that the RS has only 32 entries and only those instructions are candidates for scheduling to the execution ports • High RESOURCE_STALLS.RS_FULL percentage if the code is latency bound• The ROB has 96 entries • High RESOURCE_STALLS.ROB_FULL percentage only if • Code has long latency instructions (L2 for good performance. Execution stage: The key misses) Intel® Processor Micro-architecture - Core® microarchitecture Focus on port utilization and dependency chains •152 Other code can be executed while waiting Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Execution micro architectureThe Divider is a big potential stall source• DIV for the number Divide operations executed• IDLE_DURING_DIV for number of cycles of no port issue while the diverter is busy • Try to find some useful work to do in parallel with divide operationsExtra cycle latency for bypass between execution domains• For example: FP (ADDPS) and logical EXE ops (PAND) on XMMn • DELAYED_BYPASS.FP Data Cache Unit 0,1,5 0,1,5 0,1,5 • DELAYED_BYPASS.LOAD SIMD integer / Floating Integer • DELAYED_BYPASS.SIMD Integer SIMD MUL Point dtlb memoryorderring store forwarding load 2 store (address) 3 store (data) 4 Intel® Processor Micro-architecture - Core® microarchitecture 153 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software College Enhancements and Optimization OpportunitiesIP Prefetcher• Prefetches stride loads associated with the same IP • Uses History table • Use VTune events to identify misses when expected prefetchesMemory Disambiguation• Predicts when OK to fire load before preceding stores with unknown address • Misprediction triggers Pipeline flash and load restart • Disambiguation is temporarily disabled if frequently fails• LOAD_BLOCK.STA where Loads blocked by a preceding store with unknown address • In case not to the same address: Possible reasons for not working: Address collision with other load(s) Intel® Processor Micro-architecture - Core® microarchitecture 154 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
  • Intel® Software CollegeOther Opportunities for Performance Gain in thememory sub-system4k Aliasing• OOO engine can fire Load before preceding Store if not collides on the Store’s address • Address collision serializes execution• Address checking uses only the last 12 bits (4K) • False blocking - if Load’s & Store’s addresses have 4KB offset • e.g. accessing large, power of two, sized arrays in a loop• Resolve 4K aliasing conflicts by changing memory layout • VTune event LOAD_BLOCK.OVERLAP_STORELoad block cases• Increase the distance between the store and the dependant load, so that the store data/address is known at the time the load is dispatched • Store address unknown - LOAD_BLOCK.STA • Loads blocked by a preceding store with unknown address • Store data unknown - LOAD_BLOCK.STD • Loads blocked by a preceding store with unknown data• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE • This includes mainly uncacheable loads and split loads (loads that cross the cache line boundary) Intel® Processor Micro-architecture - Core® microarchitecture 155 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.