Verdana Bold 30
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Verdana Bold 30

on

  • 1,159 views

 

Statistics

Views

Total Views
1,159
Views on SlideShare
1,158
Embed Views
1

Actions

Likes
1
Downloads
18
Comments
0

1 Embed 1

https://duckduckgo.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Talk about the different attributes and why important to different segments…
  • “ Uncore” provides segment differentiation. Separate V/F domains decouples core and uncore designs. How build something modular… Core v. Uncore…. Core – adders vs. multipliers.. Uncore – things that glue all the cores together… Core and Uncore can be designed separately… Core is common across all the different solutions… Same uArch across all segments.. Good for software optimization… Common feature set… Uncore perspective… can differentiate here… # cores Size of LLC changes As we change cores, we change cache.. Type of memory… native vs. buffer Number of sockets… Integrated graphics available on some but not all… Different knobs that we can play to Q: is the common core, the same core as silverthorne/menlow… A: No – very different core.
  • Intel in its resolve to continue its leadership in ISA during 2008 is introducing a new set of instructions collectively called as SSE4.2. SSE4 has been defined for Intel’s 45nm products. The first version SSE4.1 was introduced in the Penryn Arch based products. A set of 7 new instructions is being introduced in Nehalem architecture in 2008 collectively called as SSE4.2. SSE4.2 is further divided into 2 distinct sub groups called STTNI and ATA. There are 4 instructions in STTNI and 2 ATA’s viz. POPCNT and CRC32. STTNI is meant for accelerated string and text processing. POPCNT is an ATA for fast pattern recognition while processing large data sets. CRC32 is another ATA that provides hardware based CRC instruction allowing for new communication capabilities
  • * Surely will be asked some why’s here…
  • Integration in combination with the flexibility of PCU, allows aggressive power reduction of components that were not well power managed in the past.
  • Integration in combination with the flexibility of PCU, allows aggressive power reduction of components that were not well power managed in the past.
  • Integration in combination with the flexibility of PCU, allows aggressive power reduction of components that were not well power managed in the past.
  • Integration in combination with the flexibility of PCU, allows aggressive power reduction of components that were not well power managed in the past.
  • Voltage needed for a given frequency changes with number of active cores, temperature
  • Lot’s of cores in small form factors. Retain frequency capability of silicon – better upside in more power constrained form factors: more cores, lower power budget.
  • You might want to use the term "page-table shadowing" to describe what a VMM needs to do in software, and describe a bit more what that means -- namely that it involves a VMM making a shadow copy of the guest page tables, which requires constant maintenance by the VMM (e.g., the VMM must intercept all guest executions of CR3 and INVPLG, and often needs to handle "induced #PFs" that arise from changes that a guest OS makes to its own page tables, which must then be reflected in the shadow copy). I don't think that you need to necessarily add all of this to the foil -- you might just speak to these points -- but a bit more text to convey the nature of this work that software needs to do isn't quite coming through. Spending a bit more time here will also set you up for your final foil, where you show how EPT eliminates these overheads of VM exits on CR3 accesses, INVLPG, etc.
  • Equal Any : Does logical OR down each column Ranges: First Compare does GE, next does LE. Then it performs logical AND of the GE/LE pairs of results and finally OR’s those results to generate the final result. Equal ordered: Does logical AND of the results along each diagonal of the matrix. Equal Each: Check each bit in the main diagonal of the matrix. Applications benefiting from STTNI: Parsing/Tokenize GZip Virus, Spam and Intrusion Detection String Processing string.h, java.lang.string, System.string No known benchmarks Flex/Bison Compiler and state machine tools DB diffing A counted string compare RegEx
  • Intel® Integrated Performance Primitives (Intel® IPP) is an extensive library of multi-core-ready, highly optimized software functions for multimedia data processing, and communications applications. If a function is an intrinsic, the code for that function is usually inserted inline, avoiding the overhead of a function call and allowing highly efficient machine instructions to be emitted for that function. An intrinsic is often faster than the equivalent inline assembly, because the optimizer has a built-in knowledge of how many intrinsics behave, so some optimizations can be available that are not available when inline assembly is used. A header file, needs to be included that declares prototypes for the intrinsic functions.
  • Imm8 is the control byte that has bit fields that controls the following attributes. Source data format : byte/word granularity, signed, unsigned etc. Aggregation operation : Encodes the mode of per-element comparison operation and aggregation of per-element comparisons into an intermediate results. Polarity : indicates any intermediate processing that needs to be performed on the intermediate result. Output selection : Depening on Index or Mask specifices the final operation to produce the output from the intermediate result.
  • Leakage and global clock distribution power are the cost of playing the high performance processor game.
  • Leakage and global clock distribution power are the cost of playing the high performance processor game.
  • Leakage and global clock distribution power are the cost of playing the high performance processor game.
  • OEM adheres to thermal design guidelines Works with Enhanced Intel Speedstep to increase energy efficiency Capability extends further in Sandy Bridge
  • OEM adheres to thermal design guidelines Works with Enhanced Intel Speedstep to increase energy efficiency Capability extends further in Sandy Bridge
  • OEM adheres to thermal design guidelines Works with Enhanced Intel Speedstep to increase energy efficiency Capability extends further in Sandy Bridge
  • OEM adheres to thermal design guidelines Works with Enhanced Intel Speedstep to increase energy efficiency Capability extends further in Sandy Bridge
  • OEM adheres to thermal design guidelines Works with Enhanced Intel Speedstep to increase energy efficiency Capability extends further in Sandy Bridge
  • OEM adheres to thermal design guidelines Works with Enhanced Intel Speedstep to increase energy efficiency Capability extends further in Sandy Bridge
  • OEM adheres to thermal design guidelines Works with Enhanced Intel Speedstep to increase energy efficiency Capability extends further in Sandy Bridge

Verdana Bold 30 Presentation Transcript

  • 1. A Look Inside Intel®: The Core (Nehalem) Microarchitecture Beeman Strong Intel ® Core™ microarchitecture (Nehalem) Architect Intel Corporation
  • 2. Agenda
    • Intel ® Core™ Microarchitecture (Nehalem) Design Overview
    • Enhanced Processor Core
      • Performance Features
      • Intel® Hyper-Threading Technology
    • New Platform
      • New Cache Hierarchy
      • New Platform Architecture
    • Performance Acceleration
      • Virtualization
      • New Instructions
    • Power Management Overview
      • Minimizing Idle Power Consumption
      • Performance when it counts
  • 3. Scalable Cores Common feature set Same core for all segments Common software optimization 45nm Servers/Workstations Energy Efficiency, Performance, Virtualization, Reliability, Capacity, Scalability Desktop Performance, Graphics, Energy Efficiency , Idle Power, Security Mobile Battery Life, Performance, Energy Efficiency , Graphics, Security Intel® Core™ Microarchitecture (Nehalem) Optimized cores to meet all market segments
  • 4. The First Intel ® Core™ Microarchitecture (Nehalem) Processor A Modular Design for Flexibility Memory Controller Core Core Core Core Q u e u e Shared L3 Cache QPI: Intel® QuickPath Interconnect (Intel® QPI) M i s c I O M i s c I O Q P I 1 Q P I 0
  • 5. Agenda
    • Intel ® Core™ Microarchitecture (Nehalem) Design Overview
    • Enhanced Processor Core
      • Performance Features
      • Intel® Hyper-Threading Technology
    • New Platform
      • New Cache Hierarchy
      • New Platform Architecture
    • Performance Acceleration
      • Virtualization
      • New Instructions
    • Power Management Overview
      • Minimizing Idle Power Consumption
      • Performance when it counts
  • 6. Intel ® Core™ Microarchitecture Recap
    • Wide Dynamic Execution
      • 4-wide decode/rename/retire
    • Advanced Digital Media Boost
      • 128-bit wide SSE execution units
    • Intel HD Boost
      • New SSE4.1 Instructions
    • Smart Memory Access
      • Memory Disambiguation
      • Hardware Prefetching
    • Advanced Smart Cache
      • Low latency, high BW shared L2 cache
    Nehalem builds on the great Core microarchitecture
  • 7. Designed for Performance Execution Units Out-of-Order Scheduling & Retirement L2 Cache & Interrupt Servicing Instruction Fetch & L1 Cache Branch Prediction Instruction Decode & Microcode Paging L1 Data Cache Memory Ordering & Execution Additional Caching Hierarchy New SSE4.2 Instructions Deeper Buffers Faster Virtualization Simultaneous Multi-Threading Better Branch Prediction Improved Lock Support Improved Loop Streaming
  • 8. Macrofusion
    • Introduced in Intel ® Core™2 microarchitecture
    • TEST/CMP instruction followed by a conditional branch treated as a single instruction
      • Decode/execute/retire as one instruction
    • Higher performance & improved power efficiency
      • Improves throughput/Reduces execution latency
      • Less processing required to accomplish the same work
    • Support all the cases in Intel Core 2 microarchitecture PLUS
      • CMP+Jcc macrofusion added for the following branch conditions
        • JL/JNGE
        • JGE/JNL
        • JLE/JNG
        • JG/JNLE
      • Intel ® Core™ microarchitecture (Nehalem) supports macrofusion in both 32-bit and 64-bit modes
        • Intel Core2 microarchitecture only supports macrofusion in 32-bit mode
    Increased macrofusion benefit on Intel® Core™ microarchitecture (Nehalem)
  • 9. Intel ® Core™ Microarchitecture (Nehalem) Loop Stream Detector
    • Loop Stream Detector identifies software loops
      • Stream from Loop Stream Detector instead of normal path
      • Disable unneeded blocks of logic for power savings
      • Higher performance by removing instruction fetch limitations
    • Higher performance : Expand the size of the loops detected (vs Core 2)
    • Improved power efficiency : Disable even more logic (vs Core 2)
    Intel Core Microarchitecture (Nehalem) Loop Stream Detector Branch Prediction Fetch Decode Loop Stream Detector 28 Micro-Ops
  • 10. Branch Prediction Improvements
    • Focus on improving branch prediction accuracy each CPU generation
      • Higher performance & lower power through more accurate prediction
    • Example Intel ® Core™ microarchitecture (Nehalem) improvements
      • L2 Branch Predictor
        • Improve accuracy for applications with large code size (ex. database applications)
      • Advanced Renamed Return Stack Buffer (RSB)
        • Remove branch mispredicts on x86 RET instruction (function returns) in the common case
    Greater Performance through Branch Prediction
  • 11. Execution Unit Overview
    • Execute 6 operations/cycle
    • 3 Memory Operations
      • 1 Load
      • 1 Store Address
      • 1 Store Data
    • 3 “Computational” Operations
    Unified Reservation Station Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Load Store Address Store Data Integer ALU & Shift Integer ALU & LEA Integer ALU & Shift Branch FP Add FP Multiply Complex Integer Divide SSE Integer ALU Integer Shuffles SSE Integer Multiply FP Shuffle SSE Integer ALU Integer Shuffles
    • Unified Reservation Station
    • Schedules operations to Execution units
    • Single Scheduler for all Execution Units
    • Can be used by all integer, all FP, etc.
  • 12. Increased Parallelism
    • Goal: Keep powerful execution engine fed
    • Nehalem increases size of out of order window by 33%
    • Must also increase other corresponding structures
    Increased Resources for Higher Performance 1 Intel ® Pentium ® M processor (formerly Dothan) Intel ® Core™ microarchitecture (formerly Merom) Intel ® Core™ microarchitecture (Nehalem) 1 Tracks all store operations allocated 32 20 Store Buffers Tracks all load operations allocated 48 32 Load Buffers Dispatches operations to execution units 36 32 Reservation Station Comment Intel ® Core™ microarchitecture (Nehalem) Intel ® Core™ microarchitecture (formerly Merom) Structure
  • 13. Enhanced Memory Subsystem
    • Responsible for:
      • Handling of memory operations (loads/stores)
    • Key Intel ® Core™2 Features
      • Memory Disambiguation
      • Hardware Prefetchers
      • Advanced Smart Cache
    • New Intel ® Core™ Microarchitecture (Nehalem) Features
      • New TLB Hierarchy (new, low latency 2 nd level unified TLB)
      • Fast 16-Byte unaligned accesses
      • Faster Synchronization Primitives
  • 14. Intel® Hyper-Threading Technology
    • Also known as Simultaneous Multi-Threading (SMT)
      • Run 2 threads at the same time per core
    • Take advantage of 4-wide execution engine
      • Keep it fed with multiple threads
      • Hide latency of a single thread
    • Most power efficient performance feature
      • Very low die area cost
      • Can provide significant performance benefit depending on application
      • Much more efficient than adding an entire core
    • Intel ® Core™ microarchitecture (Nehalem) advantages
      • Larger caches
      • Massive memory BW
    Simultaneous multi-threading enhances performance and energy efficiency Time (proc. cycles) w/o SMT SMT Note: Each box represents a processor execution unit
  • 15. SMT Performance Chart Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/ SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. For more information on SPEC benchmarks, see: http:// www.spec.org Floating Point is based on SPECfp_rate_base2006* estimate Integer is based on SPECint_rate_base2006* estimate
  • 16. Agenda
    • Intel ® Core™ Microarchitecture (Nehalem) Design Overview
    • Enhanced Processor Core
      • Performance Features
      • Intel® Hyper-Threading Technology
    • New Platform
      • New Cache Hierarchy
      • New Platform Architecture
    • Performance Acceleration
      • Virtualization
      • New Instructions
    • Power Management Overview
      • Minimizing Idle Power Consumption
      • Performance when it counts
  • 17. Designed For Modularity DRAM Intel QPI Core Uncore C O R E C O R E C O R E IMC Intel® QPI Power & Clock 2008 – 2009 Servers & Desktops … … … … L3 Cache Intel® QPI: Intel ® QuickPath Interconnect (Intel® QPI) Intel® QPI Optimal price / performance / energy efficiency for server, desktop and mobile products #QPI Links # mem channels Size of cache # cores Power Manage- ment Type of Memory Integrated graphics Differentiation in the “Uncore”:
  • 18. Intel® Smart Cache – 3 rd Level Cache
    • Shared across all cores
    • Size depends on # of cores
      • Quad-core: Up to 8MB (16-ways)
      • Scalability:
        • Built to vary size with varied core counts
        • Built to easily increase L3 size in future parts
    • Perceived latency depends on frequency ratio between core & uncore
    • Inclusive cache policy for best performance
      • Address residing in L1/L2 must be present in 3 rd level cache
    … L3 Cache Core L2 Cache L1 Caches Core L2 Cache L1 Caches Core L2 Cache L1 Caches
  • 19. Why Inclusive?
    • Inclusive cache provides benefit of an on-die snoop filter
    • Core Valid Bits
      • 1 bit per core per cache line
        • If line may be in a core, set core valid bit
        • Snoop only needed if line is in L3 and core valid bit is set
        • Guaranteed that line is not modified if multiple bits set
    • Scalability
      • Addition of cores/sockets does not increase snoop traffic seen by cores
    • Latency
      • Minimize effective cache latency by eliminating cross-core snoops in the common case
      • Minimize snoop response time for cross-socket cases
  • 20. Intel ® Core™ Microarchitecture (Nehalem-EP) Platform Architecture
    • Integrated Memory Controller
      • 3 DDR3 channels per socket
      • Massive memory bandwidth
      • Memory Bandwidth scales with # of processors
      • Very low memory latency
    • Intel® QuickPath Interconnect (Intel® QPI)
      • New point-to-point interconnect
      • Socket to socket connections
      • Socket to chipset connections
      • Build scalable solutions
      • Up to 6.4 GT/sec (12.8 GB/sec)
      • Bidirectional (=> 25.6 GB/sec)
    Nehalem EP Nehalem EP Tylersburg EP Significant performance leap from new platform Intel ® Core™ microarchitecture (Nehalem-EP) Intel ® Next Generation Server Processor Technology (Tylersburg-EP) IOH memory CPU CPU CPU CPU IOH memory memory memory
  • 21. Non-Uniform Memory Access (NUMA)
    • FSB architecture
      • All memory in one location
    • Starting with Intel ® Core™ microarchitecture (Nehalem)
      • Memory located in multiple places
    • Latency to memory dependent on location
    • Local memory has highest BW, lowest latency
    • Remote Memory still very fast
    Ensure software is NUMA- optimized for best performance Intel ® Core™ microarchitecture (Nehalem-EP) Intel ® Next Generation Server Processor Technology (Tylersburg-EP) Nehalem EP Nehalem EP Tylersburg EP Relative Memory Latency Comparison 0.00 0.20 0.40 0.60 0.80 1.00 Harpertown (FSB 1600) Nehalem (DDR3-1067) Local Nehalem (DDR3-1067) Remote Relative Memory Latency
  • 22. Memory Bandwidth – Initial Intel ® Core™ Microarchitecture (Nehalem) Products
    • 3 memory channels per socket
    • ≥ DDR3-1066 at launch
    • Massive memory BW
    • Scalability
      • Design IMC and core to take advantage of BW
      • Allow performance to scale with cores
        • Core enhancements
          • Support more cache misses per core
          • Aggressive hardware prefetching w/ throttling enhancements
        • Example IMC Features
          • Independent memory channels
          • Aggressive Request Reordering
    Massive memory BW provides performance and scalability Stream Bandwidth – Mbytes/Sec (Triad) Source: Intel Internal measurements – August 2008 1 3.4X 1 HTN: Intel® Xeon® processor 5400 Series (Harpertown) NHM: Intel® Core™ microarchitecture (Nehalem)
  • 23. Agenda
    • Intel ® Core™ Microarchitecture (Nehalem) Design Overview
    • Enhanced Processor Core
      • Performance Features
      • Intel® Hyper-Threading Technology
    • New Platform
      • New Cache Hierarchy
      • New Platform Architecture
    • Performance Acceleration
      • Virtualization
      • New Instructions
    • Power Management Overview
      • Minimizing Idle Power Consumption
      • Performance when it counts
  • 24. Virtualization
    • To get best virtualized performance
      • Have best native performance
      • Reduce transitions to/from virtual machine
      • Reduce latency of transitions
    • Intel ® Core™ microprocessor (Nehalem) virtualization features
      • Reduced latency for transitions
      • Virtual Processor ID (VPID) to reduce effective cost of transitions
      • Extended Page Table (EPT) to reduce # of transitions
    Round Trip Virtualization Latency 1
  • 25. EPT Solution
    • Intel ® 64 Page Tables
      • Map Guest Linear Address to Guest Physical Address
      • Can be read and written by the guest OS
    • New EPT Page Tables under VMM Control
      • Map Guest Physical Address to Host Physical Address
      • Referenced by new EPT base pointer
    • No VM Exits due to Page Faults, INVLPG or CR3 accesses
    Guest Linear Address CR3 Guest Physical Address EPT Base Pointer Host Physical Address Intel® 64 Page Tables EPT Page Tables
  • 26. SSE4.2 (Nehalem Core) STTNI e.g. XML acceleration POPCNT e.g. Genome Mining ATA (Application Targeted Accelerators) SSE4.1 (Penryn Core) SSE4 (45nm CPUs) CRC32 e.g. iSCSI Application New Communications Capabilities Hardware based CRC instruction Accelerated Network attached storage Improved power efficiency for Software I-SCSI, RDMA, and SCTP Accelerated Searching & Pattern Recognition of Large Data Sets Improved performance for Genome Mining, Handwriting recognition. Fast Hamming distance / Population count Accelerated String and Text Processing Faster XML parsing Faster search and pattern matching Novel parallel data matching and comparison operations STTNI ATA Extending Performance and Energy Efficiency - Intel® SSE4.2 Instruction Set Architecture (ISA) Leadership in 2008 What should the applications, OS and VMM vendors do?: Understand the benefits & take advantage of new instructions in 2008. Provide us feedback on instructions ISV would like to see for next generation of applications
  • 27. Agenda
    • Intel ® Core™ Microarchitecture (Nehalem) Design Overview
    • Enhanced Processor Core
      • Performance Features
      • Intel® Hyper-Threading Technology
    • New Platform
      • New Cache Hierarchy
      • New Platform Architecture
    • Performance Acceleration
      • Virtualization
      • New Instructions
    • Power Management Overview
      • Minimizing Idle Power Consumption
      • Performance when it counts
  • 28. Intel ® Core™ Microarchitecture (Nehalem) Design Goals Existing Apps Emerging Apps All Usages Single Thread Multi-threads Workstation / Server Desktop / Mobile World class performance combined with superior energy efficiency – Optimized for: A single, scalable, foundation optimized across each segment and power envelope Dynamically scaled performance when needed to maximize energy efficiency A Dynamic and Design Scalable Microarchitecture
  • 29. Power Control Unit PLL Uncore , LLC Core Vcc Freq . Sensors Core Vcc Freq . Sensors Core Vcc Freq . Sensors Core Vcc Freq . Sensors PLL PLL PLL PLL PCU BCLK Vcc Integrated proprietary microcontroller Shifts control from hardware to embedded firmware Real time sensors for temperature, current, power Flexibility enables sophisticated algorithms, tuned for current operating conditions
  • 30. Minimizing Idle Power Consumption
    • Operating system notifies CPU when no tasks are ready for execution
      • Execution of MWAIT instruction
    • MWAIT arguments hint at expected idle duration
      • Higher numbered C-states lower power, but also longer exit latency
    • CPU idle states referred to as “C-States”
    C0 Cn C1 Exit Latency (us) Idle Power (W)
  • 31. C6 on Intel ® Core™ Microarchitecture (Nehalem)
  • 32. C6 on Intel ® Core™ Microarchitecture (Nehalem) Cores 0, 1, 2, and 3 running applications. Core 0 Core 1 Core 2 Core 3 Core Power Time 0 0 0 0
  • 33. C6 on Intel ® Core™ Microarchitecture (Nehalem) Task completes. No work waiting. OS executes MWAIT(C6) instruction. Core 0 Core 1 Core 2 Core 3 Core Power Time 0 0 0 0
  • 34. C6 on Intel ® Core™ Microarchitecture (Nehalem) Execution stops. Core architectural state saved. Core clocks stopped. Cores 0, 1, and 3 continue execution undisturbed. Core 0 Core 1 Core 2 Core 3 Core Power Time 0 0 0 0
  • 35. C6 on Intel ® Core™ Microarchitecture (Nehalem) Core power gate turned off. Core voltage goes to 0. Cores 0, 1, and 3 continue execution undisturbed. Core 0 Core 1 Core 2 Core 3 Core Power Time 0 0 0 0
  • 36. C6 on Intel ® Core™ Microarchitecture (Nehalem) Task completes. No work waiting. OS executes MWAIT(C6) instruction. Core 0 enters C6. Cores 1 and 3 continue execution undisturbed. Core 0 Core 1 Core 2 Core 3 Core Power Time 0 0 0 0
  • 37. C6 on Intel ® Core™ Microarchitecture (Nehalem) Interrupt for Core 2 arrives. Core 2 returns to C0, execution resumes at instruction following MWAIT(C6). Cores 1 and 3 continue execution undisturbed. Core 0 Core 1 Core 2 Core 3 Core Power Time 0 0 0 0
  • 38. C6 on Intel ® Core™ Microarchitecture (Nehalem) Core independent C6 on Intel Core microarchitecture (Nehalem) extends benefits Interrupt for Core 0 arrives. Power gate turns on, core clock turns on, core state restored, core resumes execution at instruction following MWAIT(C6). Cores 1, 2, and 3 continue execution undisturbed. Core 0 Core 1 Core 2 Core 3 Core Power Time 0 0 0 0
  • 39. Intel ® Core™ Microarchitecture (Nehalem)-based Processor
    • Significant logic outside core
      • Integrated memory controller
      • Large shared cache
      • High speed interconnect
      • Arbitration logic
    Core Leakage Core Clock Distribution Core Clocks and Logic Total CPU Power Consumption Uncore Leakage Uncore Clock Distribution I/O Uncore Logic Cores (x N) QPI = Intel® QuickPath Interconnect (Intel® QPI) Q P I 1 Q P I 0 Memory Controller Core Core Core Core Shared L3 Cache M i s c I O M i s c I O Q u e u e
  • 40. Intel ® Core™ Microarchitecture (Nehalem) Package C-State Support
    • All cores in C6 state:
      • Core power to ~0
    Active CPU Power Core Leakage Core Clock Distribution Core Clocks and Logic Uncore Leakage Uncore Clock Distribution I/O Uncore Logic Cores (x N)
  • 41. Intel ® Core™ Microarchitecture (Nehalem) Package C-State Support
    • All cores in C6 state:
      • Core power to ~0
    • Package to C6 state:
      • Uncore logic stops toggling
    Active CPU Power Uncore Leakage Uncore Clock Distribution I/O Uncore Logic
  • 42. Intel ® Core™ Microarchitecture (Nehalem) Package C-State Support
    • All cores in C6 state:
      • Core power to ~0
    • Package to C6 state:
      • Uncore logic stops toggling
      • I/O to lower power state
    Active CPU Power Uncore Leakage Uncore Clock Distribution I/O
  • 43. Intel ® Core™ Microarchitecture (Nehalem) Package C-State Support
    • All cores in C6 state:
      • Core power to ~0
    • Package to C6 state:
      • Uncore logic stops toggling
      • I/O to lower power state
      • Uncore clock grids stopped
    Active CPU Power Uncore Leakage Uncore Clock Distribution I/O Substantial reduction in idle CPU power
  • 44. Managing Active Power
    • Operating system changes frequency as needed to meet performance needs, minimize power
      • Enhanced Intel SpeedStep® Technology
      • Referred to as processor P-States
    • PCU tunes voltage for given frequency, operating conditions, and silicon characteristics
    PCU automatically optimizes operating voltage
  • 45. Turbo Mode: Key to Scalability Goal
    • Intel ® Core™ microarchitecture (Nehalem) is a scalable architecture
      • High frequency core for performance in less constrained form factors
      • Retain ability to use that frequency in very small form factors
      • Retain ability to use that frequency when running lightly threaded or lower power workloads
    • Turbo utilizes available frequency:
      • Maximizes both single-thread and multi-thread performance in the same part
    Turbo Mode provides performance when you need it Nehalem
  • 46. Turbo Mode Enabling
    • Turbo Mode exposed as additional Enhanced Intel SpeedStep® Technology operating point
      • Operating system treats as any other P-state, requesting Turbo Mode when it needs more performance
      • Performance benefit comes from higher operating frequency – no need to enable or tune software
    • Turbo Mode is transparent to system
      • Frequency transitions handled completely in hardware
      • PCU keeps silicon within existing operating limits
      • Systems designed to same specs, with or without Turbo Mode
    Performance benefits with existing applications and operating systems
  • 47. Summary
    • Intel ® Core™ microarchitecture (Nehalem) – The 45nm Tock
    • Designed for
      • Power Efficiency
      • Scalability
      • Performance
    • Key Innovations:
      • Enhanced Processor Core
      • Brand New Platform Architecture
      • Sophisticated Power Management
    High Performance When You Need It Lower Power When You Don’t
  • 48. Q&A
  • 49. Legal Disclaimer
    • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
    • Intel may make changes to specifications and product descriptions at any time, without notice.
    • All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
    • Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
    • Merom, Penryn, Hapertown, Nehalem, Dothan, Westmere, Sandy Bridge, and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user
    • Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
    • Intel, Intel Inside, Intel Core, Pentium, Intel SpeedStep Technology, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
    • *Other names and brands may be claimed as the property of others.
    • Copyright © 2008 Intel Corporation.
  • 50. Risk Factors
    • This presentation contains forward-looking statements that involve a number of risks and uncertainties. These statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other similar transactions that may be completed in the future. The information presented is accurate only as of today’s date and will not be updated. In addition to any factors discussed in the presentation, the important factors that could cause actual results to differ materially include the following: Demand could be different from Intel's expectations due to factors including changes in business and economic conditions, including conditions in the credit market that could affect consumer confidence; customer acceptance of Intel’s and competitors’ products; changes in customer order patterns, including order cancellations; and changes in the level of inventory at customers. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; Intel’s ability to respond quickly to technological developments and to incorporate new features into its products; and the availability of sufficient supply of components from suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; excess or obsolete inventory; manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing, assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary depending on the level of demand for Intel's products, the level of revenue and profits, and impairments of long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the report on Form 10-Q for the quarter ended June 28, 2008.
  • 51. Backup Slides
  • 52. Intel ® Core™ Microarchitecture (Nehalem) Design Goals Existing Apps Emerging Apps All Usages Single Thread Multi-threads Workstation / Server Desktop / Mobile World class performance combined with superior energy efficiency – Optimized for: A single, scalable, foundation optimized across each segment and power envelope Dynamically scaled performance when needed to maximize energy efficiency A Dynamic and Design Scalable Microarchitecture
  • 53. Tick-Tock Development Model Forecast Penryn Nehalem Sandy Bridge Westmere NEW Microarchitecture 45nm NEW Microarchitecture 32nm NEW Process 45nm NEW Process 32nm Merom 1 NEW Microarchitecture 65nm TOCK TICK TOCK TICK TOCK All dates, product descriptions, availability and plans are forecasts and subject to change without notice. 1 Intel® Core™ microarchitecture (formerly Merom) 45nm next generation Intel® Core™ microarchitecture (Penryn) Intel® Core™ Microarchitecture (Nehalem) Intel® Microarchitecture (Westmere) Intel® Microarchitecture (Sandy Bridge)
  • 54. Enhanced Processor Core Instruction Fetch and Pre Decode Instruction Queue Decode ITLB Rename/Allocate Retirement Unit (ReOrder Buffer) Reservation Station Execution Units DTLB 2 nd Level TLB 4 4 6 32kB Instruction Cache 32kB Data Cache 256kB 2 nd Level Cache L3 and beyond Front End Execution Engine Memory
  • 55. Front-end
    • Responsible for feeding the compute engine
      • Decode instructions
      • Branch Prediction
    • Key Intel ® Core™2 microarchitecture Features
      • 4-wide decode
      • Macrofusion
      • Loop Stream Detector
    Instruction Fetch and Pre Decode Instruction Queue Decode ITLB 32kB Instruction Cache
  • 56. Loop Stream Detector Reminder
    • Loops are very common in most software
    • Take advantage of knowledge of loops in HW
      • Decoding the same instructions over and over
      • Making the same branch predictions over and over
    • Loop Stream Detector identifies software loops
      • Stream from Loop Stream Detector instead of normal path
      • Disable unneeded blocks of logic for power savings
      • Higher performance by removing instruction fetch limitations
    Intel ® Core™2 Loop Stream Detector Branch Prediction Fetch Decode Loop Stream Detector 18 Instructions
  • 57. Branch Prediction Reminder
    • Goal: Keep powerful compute engine fed
    • Options:
      • Stall pipeline while determining branch direction/target
      • Predict branch direction/target and correct if wrong
    • Minimize amount of time wasted correcting from incorrect branch predictions
      • Performance :
        • Through higher branch prediction accuracy
        • Through faster correction when prediction is wrong
      • Power efficiency : Minimize number of speculative/incorrect micro-ops that are executed
    Continued focus on branch prediction improvements
  • 58. L2 Branch Predictor
    • Problem: Software with a large code footprint not able to fit well in existing branch predictors
      • Example: Database applications
    • Solution: Use multi-level branch prediction scheme
    • Benefits:
      • Higher performance through improved branch prediction accuracy
      • Greater power efficiency through less mis-speculation
  • 59. Advanced Renamed Return Stack Buffer (RSB)
    • Instruction Reminder
      • CALL: Entry into functions
      • RET: Return from functions
    • Classical Solution
      • Return Stack Buffer (RSB) used to predict RET
      • RSB can be corrupted by speculative path
    • The Renamed RSB
      • No RET mispredicts in the common case
  • 60. Execution Engine
    • Responsible for:
      • Scheduling operations
      • Executing operations
    • Powerful Intel ® Core™2 microarchitecture execution engine
      • Dynamic 4-wide Execution
      • Intel® Advanced Digital Media Boost
        • 128-bit wide SSE
      • Super Shuffler (45nm next generation Intel ® Core™ microarchitecture (Penryn))
  • 61. Intel® Smart Cache – Core Caches
    • New 3-level Cache Hierarchy
    • 1 st level caches
      • 32kB Instruction cache
      • 32kB, 8-way Data Cache
        • Support more L1 misses in parallel than Intel ® Core™2 microarchitecture
    • 2 nd level Cache
      • New cache introduced in Intel ® Core™ microarchitecture (Nehalem)
      • Unified (holds code and data)
      • 256 kB per core (8-way)
      • Performance : Very low latency
        • 10 cycle load-to-use
      • Scalability : As core count increases, reduce pressure on shared cache
    Core 256kB L2 Cache 32kB L1 Data Cache 32kB L1 Inst. Cache
  • 62. New TLB Hierarchy
    • Problem: Applications continue to grow in data size
    • Need to increase TLB size to keep the pace for performance
    • Nehalem adds new low-latency unified 2 nd level TLB
    512 Small Page Only New 2 nd Level Unified TLB 32 Large Page (2M/4M) 64 Small Page (4k) 1 st Level Data TLBs 7 per thread Large Page (2M/4M) 128 Small Page (4k) 1 st Level Instruction TLBs # of Entries
  • 63. Fast Unaligned Cache Accesses
    • Two flavors of 16-byte SSE loads/stores exist
      • Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary
      • Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement
    • Prior to Intel ® Core™ microarchitecture (Nehalem)
      • Optimized for Aligned instructions
      • Unaligned instructions slower, lower throughput -- Even for aligned accesses!
        • Required multiple uops (not energy efficient)
      • Compilers would largely avoid unaligned load
        • 2-instruction sequence (MOVSD+MOVHPD) was faster
    • Intel Core microarchitecture (Nehalem) optimizes Unaligned instructions
      • Same speed/throughput as Aligned instructions on aligned accesses
      • Optimizations for making accesses that cross 64-byte boundaries fast
        • Lower latency/higher throughput than Core 2
      • Aligned instructions remain fast
    • No reason to use aligned instructions on Intel Core microarchitecture (Nehalem)!
    • Benefits:
      • Compiler can now use unaligned instructions without fear
      • Higher performance on key media algorithms
      • More energy efficient than prior implementations
  • 64. Faster Synchronization Primitives
    • Multi-threaded software becoming more prevalent
    • Scalability of multi-thread applications can be limited by synchronization
    • Synchronization primitives: LOCK prefix, XCHG
    • Reduce synchronization latency for legacy software
    Greater thread scalability with Nehalem 1 Intel ® Pentium ® 4 processor Intel ® Core™2 Duo processor Intel ® Core™ microarchitecture (Nehalem)-based processor 1
  • 65. Intel ® Core™ Microarchitecture (Nehalem) SMT Implementation Details SMT efficient due to minimal replication of logic Execution units No SMT impact Unaware Reservation Station Caches Data TLB 2 nd level TLB Depends on thread’s dynamic behavior Competitively Shared Load Buffer Store Buffer Reorder Buffer Small Page ITLB Statically allocated between threads Partitioned Register State Renamed RSB Large Page ITLB Duplicate logic per thread Replicated Intel ® Core™ Microarchitecture (Nehalem) Examples Description Policy
  • 66. Feeding the Execution Engine
    • Powerful 4-wide dynamic execution engine
    • Need to keep providing fuel to the execution engine
    • Intel ® Core™ Microarchitecture (Nehalem) Goals
      • Low latency to retrieve data
        • Keep execution engine fed w/o stalling
      • High data bandwidth
        • Handle requests from multiple cores/threads seamlessly
      • Scalability
        • Design for increasing core counts
    • Combination of great cache hierarchy and new platform
    Intel ® Core™ microarchitecture (Nehalem) designed to feed the execution engine
  • 67. Inclusive vs. Exclusive Caches – Cache Miss Exclusive Inclusive Core 0 Core 1 Core 2 Core 3 L3 Cache Core 0 Core 1 Core 2 Core 3 L3 Cache Data request from Core 0 misses Core 0’s L1 and L2 Request sent to the L3 cache
  • 68. Inclusive vs. Exclusive Caches – Cache Miss Exclusive Inclusive Core 0 Core 1 Core 2 Core 3 L3 Cache Core 0 Core 1 Core 2 Core 3 L3 Cache Core 0 looks up the L3 Cache Data not in the L3 Cache MISS! MISS!
  • 69. Inclusive vs. Exclusive Caches – Cache Miss Exclusive Inclusive Core 0 Core 1 Core 2 Core 3 L3 Cache Core 0 Core 1 Core 2 Core 3 L3 Cache MISS! MISS! Must check other cores Guaranteed data is not on-die Greater scalability from inclusive approach
  • 70. Inclusive vs. Exclusive Caches – Cache Hit Exclusive Inclusive Core 0 Core 1 Core 2 Core 3 L3 Cache Core 0 Core 1 Core 2 Core 3 L3 Cache HIT! HIT! No need to check other cores Data could be in another core BUT Intel ® Core TM microarchitecture (Nehalem) is smart…
  • 71. Inclusive vs. Exclusive Caches – Cache Hit Inclusive Core 0 Core 1 Core 2 Core 3 L3 Cache HIT! Core valid bits limit unnecessary snoops
    • Maintain a set of “core valid” bits per cache line in the L3 cache
    • Each bit represents a core
    • If the L1/L2 of a core may contain the cache line, then core valid bit is set to “1”
    • No snoops of cores are needed if no bits are set
    • If more than 1 bit is set, line cannot be in Modified state in any core
    0 0 0 0
  • 72. Inclusive vs. Exclusive Caches – Read from other core Exclusive Inclusive Core 0 Core 1 Core 2 Core 3 L3 Cache Core 0 Core 1 Core 2 Core 3 L3 Cache MISS! HIT! Must check all other cores Only need to check the core whose core valid bit is set 0 0 1 0
  • 73. Local Memory Access
    • CPU0 requests cache line X, not present in any CPU0 cache
      • CPU0 requests data from its DRAM
      • CPU0 snoops CPU1 to check if data is present
    • Step 2:
      • DRAM returns data
      • CPU1 returns snoop response
    • Local memory latency is the maximum latency of the two responses
    • Intel® Core™ microarchitecture (Nehalem) optimized to keep key latencies close to each other
    CPU0 CPU1 Intel ® QPI DRAM DRAM Intel® QPI = Intel® QuickPath Interconnect
  • 74. Remote Memory Access
    • CPU0 requests cache line X, not present in any CPU0 cache
      • CPU0 requests data from CPU1
      • Request sent over Intel® QuickPath Interconnect (Intel® QPI) to CPU1
      • CPU1’s IMC makes request to its DRAM
      • CPU1 snoops internal caches
      • Data returned to CPU0 over Intel QPI
    • Remote memory latency a function of having a low latency interconnect
    CPU0 CPU1 Intel ® QPI DRAM DRAM
  • 75. Hardware Prefetching (HWP)
    • HW Prefetching critical to hiding memory latency
    • Structure of HWPs similar as in Intel ® Core™2 microarchitecture
      • Algorithmic improvements in Intel ® Core™ microarchitecture (Nehalem) for higher performance
    • L1 Prefetchers
      • Based on instruction history and/or load address pattern
    • L2 Prefetchers
      • Prefetches loads/RFOs/code fetches based on address pattern
      • Intel Core microarchitecture (Nehalem) changes:
        • Efficient Prefetch mechanism
          • Remove the need for Intel® Xeon® processors to disable HWP
        • Increase prefetcher aggressiveness
          • Locks on address streams quicker, adapts to change faster, issues more prefetchers more aggressively (when appropriate)
  • 76. Today’s Platform Architecture
  • 77. Intel® QuickPath Interconnect
    • Intel ® Core™ microarchitecture (Nehalem) introduces new Intel ® QuickPath Interconnect (Intel ® QPI)
    • High bandwidth , low latency point to point interconnect
    • Up to 6.4 GT/sec initially
      • 6.4 GT/sec -> 12.8 GB/sec
      • Bi-directional link -> 25.6 GB/sec per link
      • Future implementations at even higher speeds
    • Highly scalable for systems with varying # of sockets
    Nehalem EP Nehalem EP Intel ® Core TM microarchitecture (Nehalem-EP) IOH memory CPU CPU CPU CPU IOH memory memory memory
  • 78. Integrated Memory Controller (IMC)
    • Memory controller optimized per market segment
    • Initial Intel ® Core™ microarchitecture (Nehalem) products
      • Native DDR3 IMC
      • Up to 3 channels per socket
      • Massive memory bandwidth
      • Designed for low latency
      • Support RDIMM and UDIMM
      • RAS Features
    • Future products
      • Scalability
        • Vary # of memory channels
        • Increase memory speeds
        • Buffered and Non-Buffered solutions
      • Market specific needs
        • Higher memory capacity
        • Integrated graphics
    Nehalem EP Nehalem EP Tylersburg EP DDR3 DDR3 Significant performance through new IMC Intel ® Core™ microarchitecture (Nehalem-EP) Intel ® Next Generation Server Processor Technology (Tylersburg-EP)
  • 79. Memory Latency Comparison
    • Low memory latency critical to high performance
    • Design integrated memory controller for low latency
    • Need to optimize both local and remote memory latency
    • Intel ® Core™ microarchitecture (Nehalem) delivers
      • Huge reduction in local memory latency
      • Even remote memory latency is fast
    • Effective memory latency depends per application/OS
      • Percentage of local vs. remote accesses
      • Intel Core microarchitecture (Nehalem) has lower latency regardless of mix
    1 Next generation Quad-Core Intel® Xeon® processor (Harpertown) Intel® CoreTM microarchitecture (Nehalem) 1
  • 80. Latency of Virtualization Transitions
    • Microarchitectural
      • Huge latency reduction generation over generation
      • Nehalem continues the trend
    • Architectural
      • Virtual Processor ID (VPID) added in Intel ® Core™ microarchitecture (Nehalem)
      • Removes need to flush TLBs on transitions
    Higher Virtualization Performance Through Lower Transition Latencies 1 Intel ® Core™ microarchitecture (formerly Merom) 45nm next generation Intel ® Core™ microarchitecture (Penryn) Intel ® Core™ microarchitecture (Nehalem) Round Trip Virtualization Latency 1
  • 81. Extended Page Tables (EPT) Motivation VM 1 Guest Page Table Active Page Table
    • A VMM needs to protect physical memory
      • Multiple Guest OSs share the same physical memory
      • Protections are implemented through page-table virtualization
    • Page table virtualization accounts for a significant portion of virtualization overheads
      • VM Exits / Entries
    • The goal of EPT is to reduce these overheads
    Guest page table changes cause exits into the VMM VMM maintains the active page table, which is used by the CPU Guest OS VMM CR3 CR3
  • 82. STTNI - ST ring & T ext N ew I nstructions Operates on strings of bytes or words (16b) Equal Each Instruction True for each character in Src2 if same position in Src1 is equal Src1: Test day Src2: tad tseT Mask: 01101111 Equal Ordered Instruction Finds the start of a substring (Src1) within another string (Src2) Src1: ABCA0XYZ Src2: S0BACBAB Mask: 00000010 Equal Any Instruction True for each character in Src2 if any character in Src1 matches Src1: Example Src2: atad tsT Mask: 10100000 Ranges Instruction True if a character in Src2 is in at least one of up to 8 ranges in Src1 Src1: AZ’0’9zzz Src2: taD tseT Mask: 00100001 Projected 3.8x kernel speedup on XML parsing & 2.7x savings on instruction cycles STTNI MODEL x x x x x x x T x x x x T x x x x T x x x x x x x x x x x T x x x x F x x x x x x x x x x x T x x x x T x x x x F x x x x x x x t d a s t T e T s t e a d y Check each bit in the diagonal Source1 (XMM) Source2 (XMM / M128) IntRes1 0 1 0 1 1 1 1 1 Bit 0 Bit 0
  • 83. STTNI Model AND the results along each diagonal e Check each bit in the diagonal Source1 (XMM) Source2 (XMM / M128) IntRes1 Bit 0 Bit 0 F s OR results down each column Source1 (XMM) Source2 (XMM / M128) IntRes1 Bit 0 Bit 0 EQUAL ANY EQUAL EACH Source1 (XMM) Bit 0 Source2 (XMM / M128) IntRes1 Bit 0 T e Source1 (XMM) Source2 (XMM / M128) IntRes1 Bit 0
    • First Compare
    • does GE, next
    • does LE
    • AND GE/LE pairs of results
    • OR those results
    Bit 0 EQUAL ORDERED RANGES AND the results along each diagonal Bit 0 Source1 (XMM) x x x x x x x T x x x x T x x x x T x x x x x x x x x x x T x x x x F x x x x x x x x x x x T x x x x T x x x x F x x x x x x x t d a s t T T s t e a d y 0 1 0 1 1 1 1 1 F F F F F F F F F F F F F F F F F F F F F F F T T F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F a a d t t T E a m x e l p 1 1 0 0 0 0 0 0 fF F T fF T F F F fF T F fF F T F x fF F F fF x F T x fF F T fF x x F x fT fT fT fT x x x x fT fT x fT x x x x fT x x fT x x x x fT x x x x x x x S B A 0 A B C B A C A B Y X 0 Z 0 0 0 0 1 0 0 0 F T F F F F T F T T F F F F T T T T T T T T T T T F T T T T T F F F F F F F F F F T F F F F F F F F F F F F F T T T T T T T T t D a s t T A ‘ 0’ 9 Z z z z z 0 1 0 0 0 0 0 1
  • 84. ATA - Application Targeted Accelerators CRC32 POPCNT
    • One register maintains the running CRC value as a
    • software loop iterates over data.
    • Fixed CRC polynomial = 11EDC6F41h
    • Replaces complex instruction sequences for CRC in
    • Upper layer data protocols:
      • iSCSI, RDMA, SCTP
    SRC Accumulates a CRC32 value using the iSCSI polynomial Enables enterprise class data assurance with high data rates in networked storage in any user environment. 0 1 0 . . . 0 0 1 1 63 1 0 Bit  0x3 RAX RBX 0 ZF =? 0 POPCNT determines the number of nonzero bits in the source.
    • POPCNT is useful for speeding up fast matching in
    • data mining workloads including:
    • DNA/Genome Matching
    • Voice Recognition
    • ZFlag set if result is zero. All other flags (C,S,O,A,P)
    • reset
    Data 8/16/32/64 bit Old CRC 63 31 32 0 0 New CRC 63 31 32 0 DST DST 0 X
  • 85. Tools Support of New Instructions
    • Intel® Compiler 10.x supports the new instructions
      • Nehalem specific compiler optimizations
      • SSE4.2 supported via vectorization and intrinsics
      • Inline assembly supported on both IA-32 and Intel® 64 architecture targets
      • Necessary to include required header files in order to access intrinsics
    • Intel® XML Software Suite
      • High performance C++ and Java runtime libraries
      • Version 1.0 (C++), version 1.01 (Java) available now
      • Version 1.1 w/SSE4.2 optimizations planned for September 2008
    • Microsoft Visual Studio* 2008 VC++
      • SSE4.2 supported via intrinsics
      • Inline assembly supported on IA-32 only
      • Necessary to include required header files in order to access intrinsics
      • VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions
    • Sun Studio Express* 7/08
      • Supports Intel ® Core TM microarchitecture (Merom), 45nm next generation Intel ® Core™ microarchitecture (Penryn), Intel ® Core TM microarchitecture (Nehalem)
      • SSE4.1, SSE4.2 through intrinsics
      • Nehalem specific compiler optimizations
    • GCC* 4.3.1
      • Support Intel Core microarchitecture (Merom), 45nm next generation Intel Core microarchitecture (Penryn), Intel Core microarchitecture (Nehalem)
      • via –mtune=generic.
      • Support SSE4.1 and SSE4.2 through vectorizer and intrinsics
    Broad Software Support for Intel® Core™ Microarchitecture (Nehalem)
  • 86. Software Optimization Guidelines
    • Most optimizations for Intel ® Core™ microarchitecture still hold
    • Examples of new optimization guidelines:
      • 16-byte unaligned loads/stores
      • Enhanced macrofusion rules
      • NUMA optimizations
    • Intel ® Core™ microarchitecture (Nehalem) SW Optimization Guide will be published
    • Intel ® Compiler will support settings for Intel Core microarchitecture (Nehalem) optimizations
  • 87. Example Code For strlen() int sttni_strlen(const char * src) { char eom_vals[32] = {1, 255, 0}; __asm{ mov eax, src movdqu xmm2, eom_vals xor ecx, ecx topofloop: add eax, ecx movdqu xmm1, OWORD PTR[eax] pcmpistri xmm2, xmm1, imm8 jnz topofloop endofstring: add eax, ecx sub eax, src ret } } string  equ     [esp + 4]         mov     ecx,string              ; ecx -> string         test    ecx,3                   ; test if string is aligned on 32 bits         je      short main_loop str_misaligned:         ; simple byte loop until string is aligned         mov     al,byte ptr [ecx]         add     ecx,1         test    al,al         je      short byte_3         test    ecx,3         jne     short str_misaligned         add     eax,dword ptr 0         ; 5 byte nop to align label below         align   16                      ; should be redundant main_loop:         mov     eax,dword ptr [ecx]     ; read 4 bytes         mov     edx,7efefeffh         add     edx,eax         xor     eax,-1         xor     eax,edx         add     ecx,4         test    eax,81010100h         je      short main_loop         ; found zero byte in the loop         mov     eax,[ecx - 4]         test    al,al                  ; is it byte 0         je      short byte_0         test    ah,ah                ; is it byte 1         je      short byte_1 test    eax,00ff0000h   ; is it byte 2                 je      short byte_2         test    eax,0ff000000h          ; is it byte 3         je      short byte_3         jmp     short main_loop         ; taken if bits 24-30 are clear and bit ; 31 is set byte_3:         lea     eax,[ecx - 1]         mov    ecx,string         sub     eax,ecx         ret byte_2:         lea     eax,[ecx - 2]         mov     ecx,string         sub     eax,ecx         ret byte_1:         lea     eax,[ecx - 3]         mov     ecx,string         sub     eax,ecx         ret byte_0:         lea     eax,[ecx - 4]         mov     ecx,string         sub     eax,ecx         ret strlen  endp         end STTNI Version Current Code: Minimum of 11 instructions; Inner loop processes 4 bytes with 8 instructions STTNI Code: Minimum of 10 instructions; A single inner loop processes 16 bytes with only 4 instructions
  • 88. CRC32 Preliminary Performance crc32c_sse42_optimized_version(uint32 crc, unsigned char const *p, size_t len) { // Assuming len is a multiple of 0x10     asm("pusha");     asm("mov %0, %%eax" :: "m" (crc));     asm("mov %0, %%ebx" :: "m" (p));     asm("mov %0, %%ecx" :: "m" (len));     asm("1:");      // Processing four byte at a time: Unrolled four times:       asm("crc32 %eax, 0x0(%ebx)");       asm("crc32 %eax, 0x4(%ebx)");       asm("crc32 %eax, 0x8(%ebx)");       asm("crc32 %eax, 0xc(%ebx)");        asm("add $0x10, %ebx")2;        asm("sub $0x10, %ecx");        asm("jecxz 2f");        asm("jmp 1b");     asm("2:");     asm("mov %%eax, %0" : "=m" (crc));     asm("popa");     return crc; } }
    • Preliminary tests involved Kernel code implementing
    • CRC algorithms commonly used by iSCSI drivers.
    • 32-bit and 64-bit versions of the Kernel under test
    • 32-bit version processes 4 bytes of data using
    • 1 CRC32 instruction
    • 64-bit version processes 8 bytes of data using
    • 1 CRC32 instruction
    • Input strings of sizes 48 bytes and 4KB used for the
    • test
    CRC32 optimized Code Preliminary Results show CRC32 instruction out-performing the fastest CRC32C software algorithm by a big margin 18.63 X 9.3 X Input Data Size = 4 KB 9.85 X 6.53 X Input Data Size = 48 bytes 64 - bit 32 - bit
  • 89. Idle Power Matters
    • Data center operating costs 1
      • 41M physical servers by 2010, average utilization < 10%
      • $0.50 spent on power and cooling for every $1 spent on server hardware
    • Regulatory requirements affect all segments
      • ENERGY STAR* and related requirements
    • Environmental responsibility
    Idle power consumption not just mobile concern
    • IDC’s Datacenter Trends Survey, January 2007
  • 90. CPU Core Power Consumption
    • High frequency processes are leaky
      • Reduced via high-K metal gate process, design technologies, manufacturing optimizations
    Leakage
  • 91. CPU Core Power Consumption
    • High frequency designs require high performance global clock distribution
    • High frequency processes are leaky
      • Reduced via high-K metal gate process, design technologies, manufacturing optimizations
    Leakage Clock Distribution
  • 92. CPU Core Power Consumption
    • Remaining power in logic, local clocks
      • Power efficient microarchitecture, good clock gating minimize waste
    • High frequency designs require high performance global clock distribution
    • High frequency processes are leaky
      • Reduced via high-K metal gate process, design technologies, manufacturing optimizations
    Leakage Clock Distribution Local Clocks and Logic Total Core Power Consumption Challenge – Minimize power when idle
  • 93. C-State Support Before Intel ® Core™ Microarchitecture (Nehalem)
    • C0: CPU active state
    Leakage Clock Distribution Local Clocks and Logic Active Core Power
  • 94. C-State Support Before Intel ® Core™ Microarchitecture (Nehalem)
    • C0: CPU active state
    • C1, C2 states (early 1990s):
      • Stop core pipeline
      • Stop most core clocks
    Leakage Clock Distribution Local Clocks and Logic Active Core Power
  • 95. C-State Support Before Intel ® Core™ Microarchitecture (Nehalem)
    • C0: CPU active state
    • C1, C2 states (early 1990s):
      • Stop core pipeline
      • Stop most core clocks
    • C3 state (mid 1990s):
      • Stop remaining core clocks
    Leakage Clock Distribution Active Core Power
  • 96. C-State Support Before Intel ® Core™ Microarchitecture (Nehalem)
    • C0: CPU active state
    • C1, C2 states (early 1990s):
      • Stop core pipeline
      • Stop most core clocks
    • C3 state (mid 1990s):
      • Stop remaining core clocks
    • C4, C5, C6 states (mid 2000s):
      • Drop core voltage, reducing leakage
      • Voltage reduction via shared VR
    Leakage Existing C-states significantly reduce idle power Active Core Power
  • 97. C-State Support Before Intel ® Core™ Microarchitecture (Nehalem)
    • Cores share a single voltage plane
      • All cores must be idle before voltage reduced
      • Independent VR’s per core prohibitive from cost and form factor perspective
    • Deepest C-states have relatively long exit latencies
      • System / VR handshake, ramp voltage, restore state, restart pipeline, etc.
    Deepest C-states available in mobile products
  • 98. Intel ® Core™ Microarchitecture (Nehalem) Core C-State Support
    • C0: CPU active state
    Leakage Clock Distribution Local Clocks and Logic Active Core Power
  • 99. Intel ® Core™ Microarchitecture (Nehalem) Core C-State Support
    • C0: CPU active state
    • C1 state:
      • Stop core pipeline
      • Stop most core clocks
    Leakage Clock Distribution Local Clocks and Logic Active Core Power
  • 100. Intel ® Core™ Microarchitecture (Nehalem) Core C-State Support
    • C0: CPU active state
    • C1 state:
      • Stop core pipeline
      • Stop most core clocks
    • C3 state:
      • Stop remaining core clocks
    Leakage Clock Distribution Active Core Power
  • 101. Intel ® Core™ Microarchitecture (Nehalem) Core C-State Support
    • C0: CPU active state
    • C1 state:
      • Stop core pipeline
      • Stop most core clocks
    • C3 state:
      • Stop remaining core clocks
    • C6 state:
      • Processor saves architectural state
      • Turn off power gate, eliminating leakage
    Leakage Core idle power goes to ~0 Active Core Power
  • 102. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn)
  • 103. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn) Time Cores running applications. Core 0 Core 1 Core Power 0 0
  • 104. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn) Time Task completes. No work waiting. OS executes MWAIT(C6) instruction. Core 0 Core 1 Core Power 0 0
  • 105. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn) Time Execution stops. Core architectural state saved. Core clocks stopped. Core 0 continues execution undisturbed. Core 0 Core 1 Core Power 0 0
  • 106. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn) Time Task completes. No work waiting. OS executes MWAIT(C6) instruction. Core enters C6. Core 0 Core 1 Core Power 0 0
  • 107. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn) Time VR voltage reduced. Power drops. Core 0 Core 1 Core Power 0 0
  • 108. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn) Time Interrupt for Core 1 arrives. VR voltage increased. Core 1 clocks turn on, core state restored, and core resumes execution at instruction following MWAIT(C6). Cores 0 remains idle. Core 0 Core 1 Core Power 0 0
  • 109. C6 Support on Intel ® Core™2 Duo Mobile Processor (Penryn) Time C6 significantly reduces idle power consumption Interrupt for Core 0 arrives. Core 0 returns to C0 and resumes execution at instruction following MWAIT(C6). Core 1 continues execution undisturbed. Core 0 Core 1 Core Power 0 0
  • 110. Reducing Platform Idle Power
    • Dramatic improvements in CPU idle power increase importance of platform improvements
    • Memory power:
      • Memory clocks stopped between requests at low utilization
      • Memory to self refresh in package C3, C6
    • Link power:
      • Intel® QuickPath Interconnect links to lower power states as CPU becomes less active
      • PCI Express* links on chipset have similar behavior
    • Hint to VR to reduce phases during periods of low current demand
    Intel ® Core™ microarchitecture (Nehalem) reduces CPU and platform power
  • 111. Intel ® Core™ Microarchitecture (Nehalem): Integrated Power Gate
    • Integrated power switch between VR output and core voltage supply
      • Very low on-resistance
      • Very high off-resistance
      • Much faster voltage ramp than external VR
    • Enables per core C6 state
      • Individual cores transition to ~0 power state
      • Transparent to other cores, platform, software, and VR
    Close collaboration with process technology to optimize device characteristics Core0 Core1 Core2 Core3 Memory System, Cache, I/O VTT VCC
  • 112. Agenda
    • Intel ® Core™ microarchitecture (Nehalem) power management overview
    • Minimizing idle power consumption
    • Performance when you need it
  • 113. Turbo Mode Before Intel ® Core™ Microarchitecture (Nehalem) Frequency (F) No Turbo Frequency (F) Core 0 Core 1 Core 0 Core 1 Clock Stopped Power reduction in inactive cores Workload Lightly Threaded
  • 114. Turbo Mode Before Intel ® Core™ Microarchitecture (Nehalem) Core 0 Core 1 Frequency (F) No Turbo Frequency (F) Turbo Mode In response to workload adds additional performance bins within headroom Core 0 Clock Stopped Power reduction in inactive cores Workload Lightly Threaded
  • 115. Intel ® Core™ Microarchitecture (Nehalem) Turbo Mode Frequency (F) No Turbo Workload Lightly Threaded or < TDP Frequency (F) Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Power Gating Zero power for inactive cores
  • 116. Frequency (F) No Turbo Frequency (F) Turbo Mode In response to workload adds additional performance bins within headroom Core 0 Core 1 Power Gating Zero power for inactive cores Workload Lightly Threaded or < TDP Core 0 Core 1 Core 2 Core 3 Intel ® Core™ Microarchitecture (Nehalem) Turbo Mode
  • 117. Frequency (F) No Turbo Frequency (F) Core 0 Core 1 Turbo Mode In response to workload adds additional performance bins within headroom Power Gating Zero power for inactive cores Workload Lightly Threaded or < TDP Core 0 Core 1 Core 2 Core 3 Intel ® Core™ Microarchitecture (Nehalem) Turbo Mode
  • 118. Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Frequency (F) No Turbo Workload Lightly Threaded or < TDP Frequency (F) Active cores running workloads < TDP Core 0 Core 1 Core 2 Core 3 Intel ® Core™ Microarchitecture (Nehalem) Turbo Mode
  • 119. Frequency (F) No Turbo Frequency (F) Turbo Mode In response to workload adds additional performance bins within headroom Workload Lightly Threaded or < TDP Core 0 Core 1 Core 2 Core 3 Active cores running workloads < TDP Core 2 Core 3 Core 1 Core 0 Intel ® Core™ Microarchitecture (Nehalem) Turbo Mode
  • 120. Dynamically Delivering Optimal Performance and Energy Efficiency Core 0 Core 1 Core 2 Core 3 Frequency (F) No Turbo Frequency (F) Turbo Mode In response to workload adds additional performance bins within headroom Power Gating Zero power for inactive cores Workload Lightly Threaded or < TDP Core 2 Core 3 Core 1 Core 0 Intel ® Core™ Microarchitecture (Nehalem) Turbo Mode
  • 121. Additional Sources of Information on This Topic:
    • Other Sessions / Chalk Talks / Labs:
      • TCHS001: Next Generation Intel® Core™ Microarchitecture (Nehalem) Family of Processors: Screaming Performance, Efficient Power (8/19, 3:00 – 3:50)
      • DPTS001: High End Desktop Platform Design Overview for the Next Generation Intel® Microarchitecture (Nehalem) Processor (8/20, 2:40 – 3:30)
      • NGMS001: Next Generation Intel® Microarchitecture (Nehalem) Family: Architectural Insights and Power Management (8/19, 4:00 – 5:50)
      • NGMC001: Chalk Talk: Next Generation Intel® Microarchitecture (Nehalem) Family (8/19, 5:50 – 6:30)
      • NGMS002: Tuning Your Software for the Next Generation Intel® Microarchitecture (Nehalem) Family (8/20, 11:10 – 12:00)
      • PWRS003: Power Managing the Virtual Data Center with Windows Server* 2008 / Hyper-V and Next Generation Processor-based Intel® Servers Featuring Intel® Dynamic Power Technology (8/19, 3:00 – 3:50)
      • PWRS005: Platform Power Management Options for Intel® Next Generation Server Processor Technology (Tylersburg-EP) (8/21, 1:40 – 2:30)
      • SVRS002: Overview of the Intel® QuickPath Interconnect (8/21, 11:10 – 12:00)
  • 122. Session Presentations - PDFs
    • The PDF for this Session presentation is available from our IDF Content Catalog at the end of the day at:
    • www.intel.com/idf
    • or
    • https://intel.wingateweb.com/US08/scheduler/public.jsp
  • 123. Please Fill out the Session Evaluation Form Place form in evaluation box at the back of session room
    • Thank you for your input, we use it to improve future Intel Developer Forum events