SlideShare a Scribd company logo
1 of 49
Download to read offline
Key Nehalem Choices




               Glenn Hinton
                Intel Fellow
      Nehalem Lead Architect
               Feb 17, 2010
Outline

• Review NHM timelines and overall issues
• Converged core tensions
• Big debate – wider vectors vs SMT vs more cores
• How decided features
• Power, Power, Power
• Summary




                         2
Tick-Tock Development Model:
           Pipelined developments – 5+ year projects
                                                                                                                                   Sandy
                           Merom1                  Penryn       Nehalem                      Westmere
                                                                                                                                   Bridge
                              NEW                     NEW           NEW                             NEW                            NEW
                         Microarchitecture           Process   Microarchitecture                   Process                    Microarchitecture
                             65nm                    45nm          45nm                            32nm                              32nm


                           TOCK                     TICK         TOCK                             TICK                            TOCK

                                                                     Forecast


           •   Merom started in 2001, released 2006
           •   Nehalem started in 2003 (with research even earlier)
           •   Sandy bridge started in 2005
           •   Clearly already working on new Tock after Sandy Bridge
           •   Most of Nehalem uArch was decided by mid 2004
                 – But most detailed engineering work happened in 2005/06/07
1Intel® Core™ microarchitecture (formerly Merom)
45nm next generation Intel® Core™ microarchitecture (Penryn)          All dates, product descriptions, availability and plans are forecasts and subject to
Intel® Core™ Microarchitecture (Nehalem)                                                                                            change without notice.
Intel® Microarchitecture (Westmere)
Intel® Microarchitecture (Sandy Bridge)
 3                                                                      3
Nehalem:
          Lots of competing aspects

Consumer Desktop                            Learn                Manageability

                            Play &                                                   Security
                                                        Multi-
                            Enjoy
                                       Media            core
                                     Processing                             Cell
           Health &
                                                    Power/Perf                                   Collaboration
           Wellness                                  Efficiency     SMCA
                       Graphics 4/3/2/1
                                                                                      Seamless
                      Processing Core
                                Scalability                                 *T’s      Wireless
       Service                                                                                          Cable free
        ability          TPCC
                                                                                                        computing
                          Perf                                                           Increase
                                                                           Power    Battery
                                        RAS
                                      features                          Technologies Life
       Reliability       Threading
                                                                                                    Connectivity
                                                                                   Reduce
                                   RAS                                              Form
                                 features                                          Factor
                                                                                            Anywhere,
                     Availability                                                            anytime
                                                                                            computing


      Server

                                                         4
The Blind Men and the Elephant
     It was six men of Indostan
       To learning much inclined,
     Who went to see the Elephant
        (Though all of them were blind),
     That each by observation
        Might satisfy his mind
       …
            …

     And so these men of Indostan
       Disputed loud and long,
     Each in his own opinion
       Exceeding stiff and strong,
     Though each was partly in the right,
       And all were in the wrong!
       by John Godfrey Saxe




                              5
“Converged Core” tradeoffs
Common CPU Core for multiple uses
•Mobile (Laptops)
•Desktop
•Server/HPC

•Workloads?
•How tradeoff?


                    6
“Converged Core” tradeoffs
• Mobile
    – 1/2/4 core options; scalable caches
    – Low TDP power CPU and GPU
    – Very low “average” power (great partial sleep state power)
    – Very low sleep state power
    – Low V-min to give best power efficiency when active
    – Moderate DRAM bandwidth at low power
    – Very dynamic power management
    – Low cost for volume
    – Great single threaded performance
       – Most apps single threaded

•   Desktop
•   Server
•   Workloads – Productivity, Media, ISPEC, FSPEC, 32 vs 64 bit
•   How tradeoff?



                                     7
“Converged Core” tradeoffs
• Mobile

• Desktop
  – 1/2/4 core options; scalable caches
  – Media processing, high end game performance
  – Moderate DRAM bandwidth
  – Low cost for volume
  – Great single threaded performance
     – Most apps single threaded

• Server

• Workloads – Productivity, Media, Games, ISPEC, FSPEC, 32 vs
  64 bit
• How tradeoff?




                                   8
“Converged Core” tradeoffs
• Mobile
• Desktop
• Server
  – More physical address bits (speed paths, area, power)
  – More RAS features (ECC on caches, TLBs, Metc)
  – Larger caches, TLBs, BTBs, multi-socket snoop, etc
  – Fast Locks and multi-threaded optimizations
  – More DRAM channels (BW and capacity) & more external links
  – Dynamic power management
  – Many cores (4, 8, etc) so need low power per core
  – SMT gives large perf gain since threaded apps
  – Low V-min to allow many cores to fit in low blade power
    envelops

• Workloads – Workstation, Server, ISPEC, FSPEC, 64 bit
• How tradeoff?



                              9
“Converged Core” tradeoffs
• Mobile
   –   1/2/4 core options; scalable caches
   –   Low TDP power CPU and GPU
   –   Very low “average” power (great partial sleep state power)
   –   Very low sleep state power
   –   Low V-min to give best power efficiency when active
   –   Moderate DRAM bandwidth at low power
   –   Very dynamic power management
   –   Low cost for volume
   –   Great single threaded performance (most apps single threaded)
• Desktop
   –   1/2/4 core options; scalable caches
   –   Media processing, high end game performance
   –   Moderate DRAM bandwidth
   –   Low cost for volume
   –   Great single threaded performance (most apps single threaded)
• Server
   –   More physical address bits (speed paths, area, power)
   –   More RAS features (ECC on caches, TLBs, Metc)
   –   Larger caches, TLBs, BTBs, multi-socket snoop, etc
   –   Fast Locks and multi-threaded optimizations
   –   More DRAM channels (BW and capacity) and more external links
   –   Dynamic power management
   –   Many cores (4, 8, etc) so need low power per core
   –   SMT gives large perf gain since threaded apps
   –   Low V-min to allow many cores to fit in low blade power envelops
• Workloads – Productivity, Media, Workstation, Server, ISPEC, FSPEC, 32 vs 64 bit, etc
• How tradeoff?



                                                   10
Early Base Core Selection - 2003
• Goals
     –   Best single threaded perf?
     –   Lowest cost dual core?
     –   Lowest power dual core?
     –   Best laptop battery life?
     –   Most cores that fit in server size?
     –   Best total throughput for cost/power in multi-core?
     –   Least engineering costs?
• Major options
  – Enhanced Northwood (P4) pipeline?
  – Enhanced P6 pipeline (like Merom – Core 2 Duo)?
  – New from scratch pipeline?
• Why went with enhanced P6 (Merom) pipeline?
  – Lower power per core, lower per core die size, lower total effort
  – Better SW optimization consistency
• Likely gave up some ST perf (10-20%?)
  – But unlikely to have been able to do „bigger‟ alternatives


                                    11
2004 Major Decision:
Cores vs Vectors vs SMT
 • Just use older Penryn cores and have 3 or 4 of them?
     – No single threaded performance gains
 • Put in wider vectors (like recently announced AVX)?
     – 256bit wide vectors (called VSSE back then)
     – Very power and area efficient, if doing wide vectors
     – Consumes die size and power when not using
 • Add SMT per core and have fewer cores?
     – Very power and area efficient
     – Adds a lot of complexity; some die cost and power when not using
 •   What do servers want? Lots of cores/threads
 •   Laptops? Low power cores
 •   HE Desktops? Great media/game performance
 •   Options with similar die cost:
     – 2 enhanced cores + SMT + Wide Vectors?
     – 3 enhanced cores + SMT?
     – 4 simplified cores?


                                   12
2 cores vs 4 Cores pros/cons
• 2 Cores+VSSE+SMT                            • 4 Cores
  –   Somewhat smaller die size                         – Better generic 4T perf
  –   Lower power than 4 core                              – TPPC, multi-tasking
  –   Better ST perf                                    – Best media perf on legacy 4T-enabled
  –   Best media if use VSSE?                             apps
       – Not clear – looks like a wash                  – Simpler HW design
  – Specialized MIPS sometimes unused                   – Die size somewhat bigger
  – VSSE gives perf for apps not easily                 – Simpler/harder SW enabling
    threaded                                               – Simpler since no VSSE
       – Is threading really mainly for wizards?           – No SMT asymmetries
  – New visible ISA feature like MMX,                      – Harder since general 4T
    SSE                                                 – 4T perf is also specialized
                                                           – But probably less than VSSE
                                                        – TDP Power somewhat higher
                                                        – Somewhat lower average power
                                                           – Since smaller single core
                                                        – More granular to hit finer segments
                                                          (1/2/3/4 core options)
                                                        – More complex power management
                                                        – Trade uArch change resources for
                                                          power reduction



                                                   13
Nehalem Direction                                 120.0
                                                                                              Specrate




                                            Core Power
                                                  100.0
• Tech Readiness Direction                                            2 Cores + VSSE + SMT

                                                                      3 Cores + SMT

  – 4/3/2/1 cores supported
                                                    80.0
                                                                      3 Cores




                                          Power
  – VSSE dropped                                    60.0

     – Saves power and die size                     40.0

  – SMT maintained                                  20.0

• VSSE in core less valued by
  servers and mobile
                                                         0.0
                                                            0.50        0.70           0.90           1.10           1.30       1.50           1.70

  – Casualty of Converged Core                                                                        Perf

  – Utilize scaled 4 core solution                                                        DH / Media
    to recover media perf                         120


• SMT to increase threaded perf                   100
                                                                   2 Cores + VSSE + SMT
                                                                   3 Cores




                                            Core Power
  – Initially target servers                      80
                                                                   4 Cores


• Spend more effort to reduce
                                          Power
  power                                           60


                                                  40


                                                  20


                                                    0
                                                     0.00      0.25   0.50      0.75   1.00    1.25    1.50   1.75     2.00   2.25     2.50   2.75
                                                                                                  Perf




                                     14
Early goal: Remove Multi-Core Perf tax
(In power constrained usages)

• Early Multi-cores lower freq than single core variants
• When started Nehalem dual cores still 2-3 years away…
   – Wanted 4 cores in high-end volume systems – A big increase
• Lower power envelopes planned for all Nehalem usages
   – Thinner laptops, blade servers, small-form-factor desktops, etc
• Many apps still single threaded
• All cores can have a lot of power - limits highest TDP freq
   – If just had one core the highest frequency could be a lot higher
• Turbo Mode/Power Gates removed this multi-core tax
   – One of biggest ST perf gains for mobile/power constrained usages
• Considered ways to do Turbo Mode
   – Decided must have flexible means to tune late in project
   – Added PCU with micro-controller to dynamically adapt




                                   15
Power Control Unit
                  Vcc         BCLK

 Core
                              PLL
           Vcc
                                               Integrated proprietary
                                               microcontroller
          Freq.         PLL
        Sensors

 Core
                                               Shifts control from
                                               hardware to embedded
           Vcc
                                               firmware
          Freq.         PLL
        Sensors                                Real time sensors for
                                               temperature, current,
 Core
                                     PCU       power
           Vcc
          Freq.         PLL
                                               Flexibility enables
        Sensors                                sophisticated
                                               algorithms, tuned for
 Core
                                     Uncore,
                                               current operating
           Vcc
                                      LLC      conditions
          Freq.         PLL
        Sensors




                                       16
Intel® Core™ Microarchitecture
     (Nehalem) Turbo Mode

                                         Power Gating
                                     Zero power for inactive
                                        cores (C6 state)

                     No Turbo




                                                                               Core 0

                                                                               Core 2
                                                                               Core 3
                                                                               Core 1
                                                               Frequency (F)
                     Core 0
                     Core 1
                     Core 2
                     Core 3
     Frequency (F)




                                Workload Lightly Threaded
                                                 or < TDP




17                                                 17
Intel® Core™ Microarchitecture
     (Nehalem) Turbo Mode

                                         Power Gating                             Turbo Mode
                                     Zero power for inactive            In response to workload
                                             cores                    adds additional performance
                                                                         bins within headroom
                     No Turbo




                                                                               Core 0
                                                                               Core 1
                                                               Frequency (F)
                     Core 0
                     Core 1
                     Core 2
                     Core 3
     Frequency (F)




                                Workload Lightly Threaded
                                                 or < TDP




18                                                 18
Intel® Core™ Microarchitecture
     (Nehalem) Turbo Mode

                                         Power Gating                             Turbo Mode
                                     Zero power for inactive            In response to workload
                                             cores                    adds additional performance
                                                                         bins within headroom
                     No Turbo




                                                                               Core 0
                                                                               Core 1
                                                               Frequency (F)
                     Core 0
                     Core 1
                     Core 2
                     Core 3
     Frequency (F)




                                Workload Lightly Threaded
                                                 or < TDP




19                                                 19
uArch features – Converged Core

• Difficult balancing the needs of the 3 conflicting requirements
• All CPU core features must be very power efficient
  – Helps all segments, especially laptops and multi-core servers
  – Requirement was to beat a 2:1 power/perf ratio
  – Ended up more like 1.3:1 power/perf ratio for perf features added
• Segment specific features can‟t add much power or die size
• Initial Uncore optimized for 4 core DP server but had to scale
  down well to 2 core volume part
• Many things mobile wanted also helped servers
  – Low power cores, lower die size per core, etc
  – Active power management
  – More synergy than originally thought




                                  20
Nehalem Power Efficiency Features
• Only adding power efficient
  uArch features                                                        Approximate Relative
                                                                       Relative Core Power/Performance
  – Net power : performance ratio                                            Core 1266 EOL
                                                                                  Power/Perf
    of Nehalem core ~1.3 : 1                           25.00

     – Far better than voltage scaling
• Reducing min operating voltage                       20.00           Core Power/Perf Curves
  with linear freq decline




                                              Core Power
  – Cubic power reduction with
    ~linear perf reduction                             15.00




                                              Core Power
• Implementing C6/Power Gated
  low-power state                                      10.00

  – Provides significant reduction in
    average mobile power                                   5.00

• Turbo mode                                                                                                            NHM
                                                                                                                        PENRYN
  – Allows processor to utilize entire                                                                                  MEROMC

    available power envelope
                                                           0.00
                                                              0.40   0.50   0.60   0.70   0.80   0.90   1.00   1.10   1.20   1.30   1.40

  – Reduces performance penalty                                                Relative Performance to MeromC@1.1V

    from multi-core on ST apps



                                         21
Designed for Performance
    New SSE4.2             Improved Lock              Additional Caching
     Instructions             Support                     Hierarchy


                           L1 Data Cache         L2 Cache
                                                 & Interrupt
            Execution                            Servicing
              Units
                          Memory Ordering
Deeper                      & Execution            Paging
Buffers
                                                                   Improved
                                              Branch Prediction      Loop
           Out-of-Order      Instruction                           Streaming
           Scheduling &      Decode &
            Retirement       Microcode        Instruction Fetch
                                                 & L1 Cache


   Simultaneous                                             Better Branch
  Multi-Threading                Faster
                             Virtualization                  Prediction



                                    22
The First Intel® Core™ Microarchitecture
(Nehalem) Processor
                   Memory Controller
    M                                                    M
    i                                                    i
    s                                                    s
    c                                                    c
        Core      Core      Q     Core        Core
    I                       u                            I
    O                       e                            O
                            u
                            e
    Q                                                    Q
    P                                                    P
    I                                                    I
                   Shared L3 Cache
    0                                                    1


                                                   QPI: Intel® QuickPath
                                                 Interconnect (Intel® QPI)

           A Modular Design for Flexibility

                            23
Scalable Cores
  Same core for         Common software      Common feature set
  all segments            optimization


      Intel® Core™
    Microarchitecture                            Servers/Workstations
        (Nehalem)
                                             Energy Efficiency,
       45nm                                  Performance, Virtualization,
                                             Reliability, Capacity, Scalability



                                                         Desktop
                                             Performance, Graphics, Energy
                                             Efficiency, Idle Power, Security



                                                          Mobile
                                            Battery Life, Performance,
                                            Energy Efficiency, Graphics,
                                            Security




          Optimized cores to meet multiple market segments
                                24
Modularity                     45nm Lynnfield/Clarksfield

45nm Nehalem Core i7
                    Misc IO


                      Core
Memory Controller




                      Core
                                Shared
                        Queue     L3
                                Cache
                      Core
                                                iGFX/Media
                                                 Complex
                      Core


                    Misc IO     QPI 0

                                         32nm Clarkdale/Arrandale
                Converged Core
                     Architecture
                                           25
Intel® Xeon® Processor 5500 series based Server platforms
                                    Server Performance comparison to Xeon 5400 Series

  Relative Performance
     Higher is better
                                               Xeon 5500 vs Xeon 5400 on Server Benchmarks                                                                                                        2.66
                                                                                                                                                               2.52              2.54
                                                                                                                            2.28              2.30
                                                                                                           2.03
                                                                                          1.93
                                     1.72              1.74             1.77
                    1.64


   1.00

   Xeon
    5400
   series
                  Server                              Energy             Java             App                              Floating                                                            Virtualizat
                                      Integer                                                                ERP                            Database         Database             Web
                 Side Java                           Efficiency          Apps            Server                             point                                                                  ion

  Baseline         SPECjbb*         SPECint*_      SPECpower*_        SPECjvm*         SPECjApp*         SAP-SD* 2-        SPECfp*_           TPC*-C           TPC*-E         SPECWeb*           VMmark*
                     2005             rate_          ssj2008            2008           Server2004           Tier             rate_                                              2005
                                    base2006                                                                               base2006


                                                                               Source: Published/submitted/approved results June 1,2009. See backup for additional
                                                                               details




                    Leadership on key server benchmarks

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system
  hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering
purchasing. For more information on performance tests and on the performance of Intel products, visit 26
                                                                                                      http://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. * Other names
Intel® Xeon® Processor 5500 series based Server platforms
                                       HPC Performance comparison to Xeon 5400 Series

  Relative Performance
     Higher is better
                                                 Xeon 5500 vs Xeon 5400 on HPC Benchmarks
                                                                                                                                                               2.92              2.95             2.96

                                                                                                                            2.55              2.54
                                                                                                           2.39
                                                                                          2.27
                                     2.12              2.15             2.17
                    1.94



   1.00
   Xeon
    5400
   series

                  Weather              FEA              FEA               CFD              CFD             Energy         Open MP             Energy         Open MP            Weather          Energy


  Baseline      MM5 *v4.7.4 - LS-DYNA* - 3            ANSYS* -        Star-CD* A-   ANSYS*     CMG IMEX*                 SPECompM*          Eclipse* -      SPECompL* WRF* v3.0.1 -             Landmark*
                    t3a          Vehicle             Distributed         Class    FLUENT* 12.0                            base2001          300/2mm          base2001 12km CONUS                  Nexus
                                Collision                                            bmk


                                                                               Source: Published/submitted/approved results March 30, 2009. See backup for
                                                                               additional details




                    Exceptional gains on HPC applications

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system
  hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering
purchasing. For more information on performance tests and on the performance of Intel products, visit 27
                                                                                                      http://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. * Other names
Summary

• Nehalem was Intel‟s first truly „converged core‟

• Difficult tradeoffs between segments

• Result: Outstanding Server, DT, and mobile parts

• Acknowledge the outstanding Architects, Designers,
  and Validators that made this project a great
  success
  – A great team overcomes almost all challenges




                              28
29
Intel® Xeon® 5500 Performance
 Publications SPECpower*_ssj2008
 SPECint*_rate_base2006          SPECfp*_rate_base2006
  241 score       (+72%)                                                1977 ssj_ops/watt           (+74%)                         197 score      (+128%)
                                                                  IBM J9* JVM


  SPECjAppServer*2004                                              TPC*-C                                                        SAP-SD* 2-Tier
  3,975 JOPS        (+93%)                                         631,766 tpmC         (+130%)                                  5,100 SD Users          (+103%)
  Oracle WebLogic* Server                                          Oracle 11g* database                                          SAP* ERP 6.0/IBM DB2*


 VMmark*                                                          TPC*-E                                                           SPECWeb*2005
 24.35 @17 tiles        (+166%)                                   800 tpsE      (+152%)                                                     75023 score        (+150%)
 VMware* ESX 4.0                                                  Microsoft SQL Server* 2008                                       Rock Web* Server


  Fluent* 12.0 benchmark                                          SPECjbb*2005                                                     SPECapc* for Maya 6.5
  Geo mean of 6          (+127%)                                  604,417 BOPS           (+64%)                                    7.70 score      (+87%)
  ANSYS FLUENT*                                                   IBM J9* JVM                                                      Autodesk* Maya



  Over 30 New 2S Server and Workstation World Records!
 Percentage gains shown are based on comparison to Xeon 5400 series; Performance results based on published/submitted results as of
  April 27, 2009. Platform configuration details are available at http://www.intel.com/performance/server/xeon/summary.htm *Other
                                       names and brands may be claimed as the property of others

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by
  those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to
                                                                                             30
evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products,
Intel® Xeon® Processor 5500 Series based Platforms
ISV Application Performance

                                                                                  Xeon 5500 vs                                                                                                    Xeon 5500 vs
                   ISV Application                                                 Xeon 5400                                       ISV Application                                                 Xeon 5400

   Ansys* CFX11* – CFD Simulation                                                     +88%                   Kingsoft* JXIII Online* – Game Server                                                    +98%
   ERDAS* ERMapper Suite*                                                             +69%                   Mediaware* Instream* – Video Conv.                                                       +73%
   ESI Group* PAM-CRASH*                                                              +50%                   Neowiz* Pmang* – Game Portal                                                             +100%
   ExitGames* Neutron* – Service Platform                                             +80%                   Neusoft* – Healthcare                                                                    +131%
   Epic* – EMR Solution                                                               +82%                   Neusoft* – Telecom BSS VMware*                                                           +115%
   Giant* Juren* – Online Game                                                       +160%                   NHN Corp* Cubrid* – Internet DB                                                          +44%
   IBM* DB2* 9.5 – TPoX XML                                                           +60%                   QlikTech* QlikView* – BI                                                                 +36%
   IBM* Informix* Dynamic Server                                                      +84%                   SAS* Forecast Server*                                                                    +80%
   IBM* solidDB* – In Memory DB                                                       +87%                   SAP* NetWeaver* – BI                                                                     +51%
   Image Analyzer * – Image Scanner                                                  +100%                   SAP* ECC 6.0* – ERP Workload                                                             +71%
   Infowrap* – Small Data Set                                                        +129%                   Schlumberger* Eclipse300*                                                                +213%
   Infowrap* – Weather Forecast                                                      +155%                   SunGard* BancWare Focus ALM*                                                             +38%
   Intense* IECCM* – Output Mgmt.                                                    +150%                   Supcon* – APC Intel. Sensor                                                              +191%
   Intersystems* Cache* – EMR                                                         +63%                   TongTech* – Middleware                                                                   +95%
   Kingdee* APUSIC* – Middleware                                                      +93%                   UFIDA* NC* – ERP Solution                                                                +230%
   Kingdee* EAS* – ERP                                                               +142%                   UFIDA* Online – SaaS Hosting                                                             +237%
   Kingdom* – Stock Transaction                                                      +141%                   Vital Images* – Brain Perfusion 4DCT*                                                    +77%



                 Exceptional gains (1.6x to 3x) on ISV applications

    Source: Results measured and approved by Intel and ISV partners on pre-production platforms. March 30, 2009.
     Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system
       hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering
     purchasing. For more information on performance tests and on the performance of Intel products, visit 31
                                                                                                           http://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. * Other names
Execution Unit Overview
                                                   Execute 6 operations/cycle
 Unified Reservation Station
                                                   • 3 Memory Operations
 • Schedules operations to Execution units
                                                         • 1 Load
 • Single Scheduler for all Execution Units
                                                         • 1 Store Address
 • Can be used by all integer, all FP, etc.
                                                         • 1 Store Data
                                                   • 3 “Computational” Operations


                          Unified Reservation Station
               Port 0




                                                                           Port 4
                                          Port 2
                               Port 1




                                                             Port 3




                                                                                          Port 5
   Integer ALU &                                                         Store       Integer ALU &
                                        Load
                        Integer ALU &                     Store
       Shift                LEA                          Address         Data            Shift

  FP Multiply            FP Add                                                        Branch
                          Complex
     Divide                                                                          FP Shuffle
                          Integer
  SSE Integer ALU       SSE Integer
                                                                                    SSE Integer ALU
  Integer Shuffles        Multiply
                                                                                    Integer Shuffles




                                                   32
Increased Parallelism                                                                            1
                                                                     Concurrent uOps Possible
• Goal: Keep powerful execution                                128
  engine fed                                                   112

• Nehalem increases size of out-of-                             96
                                                                80
  order window by 33%                                           64
• Must also increase other                                      48

  corresponding structures                                      32
                                                                16
                                                                 0
                                                                     Dothan     Merom       Nehalem


Structure                       Intel® Core™             Intel® Core™                   Comment
                                microarchitecture        microarchitecture
                                (formerly Merom)         (Nehalem)
Reservation Station             32                       36                             Dispatches operations to
                                                                                        execution units
Load Buffers                    32                       48                             Tracks all load
                                                                                        operations allocated
Store Buffers                   20                       32                             Tracks all store
                                                                                        operations allocated




                 Increased Resources for Higher Performance
 1Intel® Pentium® M processor (formerly Dothan)
 Intel® Core™ microarchitecture (formerly Merom)
 Intel® Core™ microarchitecture (Nehalem)

                                                    33
New TLB Hierarchy

• Problem: Applications continue to grow in data size
• Need to increase TLB size to keep the pace for performance
• Nehalem adds new low-latency unified 2nd level TLB

                                       # of Entries
               1st Level Instruction TLBs

               Small Page (4k)         128

               Large Page (2M/4M)      7 per thread

               1st Level Data TLBs

               Small Page (4k)         64

               Large Page (2M/4M)      32

               New 2nd Level Unified TLB

               Small Page Only         512




                                      34
Faster Synchronization Primitives
                                                                                                                        1
                                                                                            LOCK CMPXCHG Perform ance
     • Multi-threaded software                                                       1
       becoming more prevalent                                                     0.9
                                                                                   0.8
     • Scalability of multi-thread                                                 0.7
       applications can be limited by




                                                                Relative Latency
                                                                                   0.6
       synchronization                                                             0.5
                                                                                   0.4
     • Synchronization primitives:                                                 0.3
       LOCK prefix, XCHG                                                           0.2
     • Reduce synchronization latency                                              0.1
                                                                                     0
       for legacy software                                                               Pentium 4        Core 2            Nehalem




               Greater thread scalability with Nehalem

1Intel® Pentium® 4 processor
Intel® Core™2 Duo processor
Intel® Core™ microarchitecture (Nehalem)-based processor
                                                           35
Intel® Hyper-Threading Technology
• Also known as Simultaneous Multi-Threading (SMT)
   – Run 2 threads at the same time per core                                          w/o SMT     SMT
• Take advantage of 4-wide execution engine
   – Keep it fed with multiple threads
   – Hide latency of a single thread
• Most power efficient performance feature




                                                                Time (proc. cycles)
   – Very low die area cost
   – Can provide significant performance benefit depending on
     application
   – Much more efficient than adding an entire core
• Intel® Core™ microarchitecture (Nehalem) advantages
   – Larger caches
   – Massive memory BW




                                                                                                Note: Each box
                                                                                                  represents a
                                                                                                   processor
                                                                                                 execution unit


 Simultaneous multi-threading enhances
    performance and energy efficiency

                                                 36
Intel® Hyper-Threading Technology
                                     • Nehalem is a scalable multi-core architecture
                                     • Hyper-Threading Technology augments benefits
                                          – Power-efficient way to boost performance in all form factors:
                                            higher multi-threaded performance, faster multi-tasking
                                            response

                                                                              Hyper-Threading     Multi-cores
                                Without HT                                  Shared or
                                Technology                                 Partitioned Replicated Replicated
                                                      Register State                       X            X
                                                      Return Stack                         X            X
                                                      Reorder Buffer             X                      X
                                                      Instruction TLB            X                      X
                                                      Reservation Stations      X                       X
                                                      Cache (L1, L2)            X                       X
                               With HT                Data TLB                  X                       X
                             Technology
                                                      Execution Units           X                       X

                  • Next generation Hyper-Threading Technology:
                                       – Low-latency pipeline architecture
                                        – Enhanced cache architecture
                                         – Higher memory bandwidth


Enables 8-way processing in Quad Core systems,
      4-way processing in Small Form Factors
 Intel® Microarchitecture codenamed Nehalem           37
Caches
• New 3-level Cache Hierarchy
• 1st level caches
   – 32kB Instruction cache
   – 32kB, 8-way Data Cache
      – Support more L1 misses in parallel than Intel®         Core
        Core™2 microarchitecture                          32kB L1    32kB L1

• 2nd level Cache                                        Data Cache Inst. Cache

   – New cache introduced in Intel® Core™
     microarchitecture (Nehalem)
   – Unified (holds code and data)
   – 256 kB per core (8-way)
   – Performance: Very low latency
      – 10 cycle load-to-use
   – Scalability: As core count increases, reduce                256kB
                                                               L2 Cache
     pressure on shared cache




                                             38
3rd Level Cache

• Shared across all cores
• Size depends on # of cores
   – Quad-core: Up to 8MB (16-ways)
   – Scalability:                                    Core       Core           Core
      – Built to vary size with varied core counts   L1 Caches L1 Caches       L1 Caches

      – Built to easily increase L3 size in future
        parts
• Perceived latency depends on frequency                                   …
  ratio between core & uncore
                                                     L2 Cache   L2 Cache       L2 Cache
• Inclusive cache policy for best
  performance
                                                                L3 Cache
   – Address residing in L1/L2 must be present
     in 3rd level cache




                                               39
Hardware Prefetching (HWP)

• HW Prefetching critical to hiding memory latency
• Structure of HWPs similar as in Intel® Core™2 microarchitecture
  – Algorithmic improvements in Intel® Core™ microarchitecture (Nehalem)
    for higher performance
• L1 Prefetchers
  – Based on instruction history and/or load address pattern
• L2 Prefetchers
  – Prefetches loads/RFOs/code fetches based on address pattern
  – Intel Core microarchitecture (Nehalem) changes:
     – Efficient Prefetch mechanism
        – Remove the need for Intel® Xeon® processors to disable HWP
     – Increase prefetcher aggressiveness
        – Locks on address streams quicker, adapts to change faster, issues more prefetchers more
          aggressively (when appropriate)




                                             40
SW Prefetch Behavior

• PREFETCHT0: Fills L1/L2/L3
• PREFETCHT1/T2: Fills L2/L3
• PREFETCHNTA: Fills L1/L3, L1 LRU is not updated

• SW prefetches can conduct page walks
• SW prefetches can spawn HW prefetches
  – SW prefetch caching behavior not obeyed on HW
    prefetches




                             41
Memory Bandwidth – Initial Intel® Core™
  Microarchitecture (Nehalem) Products

    •     3 memory channels per socket
                                                                                 Stream Bandwidth – Mbytes/Sec (Triad)
    •     ≥ DDR3-1066 at launch
    •     Massive memory BW
    •     Scalability                                                                                                          33376
           – Design IMC and core to take advantage of
             BW                                                                                                              3.4X
           – Allow performance to scale with cores
                 –   Core enhancements
                      –   Support more cache misses per core
                      –   Aggressive hardware prefetching w/ throttling
                          enhancements
                 –   Example IMC Features
                                                                                                            9776
                      –   Independent memory channels
                      –   Aggressive Request Reordering                                 6102



                                                                                   HTN 3.16/ BF1333/   HTN 3.00/ SB1600/   NHM 2.66/ 6.4 QPI/
                                                                                     667 MHz mem         800 MHz mem        1066 MHz mem

                                                                               Source: Intel Internal measurements – August 20081



         Massive memory BW provides performance and scalability

1HTN:   Intel® Xeon® processor 5400 Series (Harpertown)
NHM: Intel® Core™ microarchitecture (Nehalem)
                                                                          42
Memory Latency Comparison
    •     Low memory latency critical to high performance
    •     Design integrated memory controller for low latency
    •     Need to optimize both local and remote memory latency
    •     Intel® Core™ microarchitecture (Nehalem) delivers
            – Huge reduction in local memory latency
            – Even remote memory latency is fast
    • Effective memory latency depends per application/OS
            – Percentage of local vs. remote accesses
            – Intel Core microarchitecture (Nehalem) has lower latency regardless of mix




                                                                                                                 1
                                                                          Relative Memory Latency Comparison

                                                    1.00
                          Relative Memory Latency




                                                    0.80

                                                    0.60

                                                    0.40

                                                    0.20

                                                    0.00
                                                           Harpertow n (FSB 1600)    Nehalem (DDR3-1067) Local   Nehalem (DDR3-1067) Remote




1Next   generation Quad-Core Intel® Xeon® processor (Harpertown)
Intel® CoreTM microarchitecture (Nehalem)
                                                                                             43
Virtualization
• To get best virtualized performance
  – Have best native performance
  – Reduce:
     – # of transitions into/out of virtual machine
     – Latency of transitions
• Intel® Core™ microprocessor (Nehalem) virtualization features
  – Reduced latency for transitions
  – Virtual Processor ID (VPID) to reduce effective cost of transitions
  – Extended Page Table (EPT) to reduce # of transitions




       Great virtualization performance with Intel®
           Core™ microarchitecture (Nehalem)


                                        44
Latency of Virtualization Transitions
     • Microarchitectural
           – Huge latency reduction generation                                         Round Trip Virtualization Latency 1
             over generation
           – Nehalem continues the trend
                                                                                       100%
     • Architectural




                                                                    Relative Latency
           – Virtual Processor ID (VPID) added                                          80%
             in Intel® Core™ microarchitecture                                          60%
             (Nehalem)
           – Removes need to flush TLBs on                                              40%
             transitions
                                                                                        20%
                                                                                          0%
                                                                                               Merom    Penryn   Nehalem




                          Higher Virtualization Performance Through
                                  Lower Transition Latencies

1Intel® Core™ microarchitecture (formerly Merom)
45nm next generation Intel® Core™ microarchitecture (Penryn)
Intel® Core™ microarchitecture (Nehalem)
                                                               45
Extended Page Tables (EPT) Motivation
VM1                                                • A VMM needs to protect
            Guest OS                                 physical memory
                                                        • Multiple Guest OSs share
      CR3                                                 the same physical memory
        Guest Page Table                                • Protections are
                                                          implemented through
                   Guest page table changes               page-table virtualization
                   cause exits into the VMM
                                                   • Page table virtualization
             VMM                                     accounts for a
                                                     significant portion of
      CR3                                            virtualization overheads
       Active Page Table
                                                        • VM Exits / Entries
                                                   • The goal of EPT is to
                       VMM maintains the active
                       page table, which is used     reduce these overheads
                              by the CPU




                                                   46
EPT Solution
                                    EPT
          CR3                  Base Pointer

Guest                           Guest                        Host
                  Intel® 64                      EPT
Linear           Page Tables   Physical       Page Tables   Physical
Address                        Address                      Address


 • Intel® 64 Page Tables
    – Map Guest Linear Address to Guest Physical Address
    – Can be read and written by the guest OS
 • New EPT Page Tables under VMM Control
    – Map Guest Physical Address to Host Physical Address
    – Referenced by new EPT base pointer
 • No VM Exits due to Page Faults, INVLPG or CR3 accesses




                                  47
Intel® Core™ Microarchitecture
(Nehalem-EP) Platform Architecture
• Integrated Memory Controller
      – 3 DDR3 channels per socket
      – Massive memory bandwidth
      – Memory Bandwidth scales with # of                                  Nehalem   Nehalem
        processors                                                           EP        EP
      – Very low memory latency
• Intel® QuickPath Interconnect (Intel®
  QPI)
                                                                                         Tylersburg
      –   New point-to-point interconnect
                                                                                             EP
      –   Socket to socket connections
      –   Socket to chipset connections
      –   Build scalable solutions




              Significant performance leap from new platform


 Intel® Core™ microarchitecture (Nehalem-EP)
 Intel® Next Generation Server Processor Technology (Tylersburg-EP)

                                                                      48
Intel Next-Generation Mainstream Processors1
         Feature                       Core™ i7              Lynnfield             Clarkdale Clarksfield               Arrandale
Processing Threads
[via Intel® Hyper-Threading                   8                    Up to 8           Up to 4          Up to 8            Up to 4
Technology (HT)]

Processor Cores                               4                      4                 2                    4                2

Shared Cache                                8MB                Up to 8MB           Up to 4MB         Up to 8MB          Up to 4MB
Integrated Memory
                                       3 ch. DDR3                                              2 ch. DDR3
Controller Channels
DDR Freq Support
                                        800, 1066                                   1066, 1333                          800, 1066
(sku dependent)

# DIMMs/Channels                              2                                2                                   1

                                    2x16 or 4x8, 1x4
PCI Express* 2.0                                              1x16 or 2x8          1x16 or 2x8      1x16 or 2x8        1x16 (1.0)
                                       (via X58)
Processor Graphics                           No                      No                Yes              No                  Yes
Processor Package TDP                      130W                     95W               73W          55W and 45W      35W, 25W, 18W

Socket                                   LGA 1366                          LGA 1156                             rPGA, BGA

Platform Support                      X58 & ICH10                                     Intel® 5 series Chipset
Processor Core Process
                                           45nm                    45nm              32nm              45nm                 32nm
Technology

  Bringing Intel® Core™ i7 Benefits into Mainstream
      1Not   all features are on all products, subject to change          49

More Related Content

What's hot

Distributed Simulations with DDS and HLA
Distributed Simulations with DDS and HLADistributed Simulations with DDS and HLA
Distributed Simulations with DDS and HLAAngelo Corsaro
 
Smartphone Hardware Architecture
Smartphone Hardware ArchitectureSmartphone Hardware Architecture
Smartphone Hardware ArchitectureYong Heui Cho
 
Advanced Industrial IoT, IIoT Training Crash Course For You - Tonex Training
Advanced Industrial IoT, IIoT Training Crash Course For You - Tonex TrainingAdvanced Industrial IoT, IIoT Training Crash Course For You - Tonex Training
Advanced Industrial IoT, IIoT Training Crash Course For You - Tonex TrainingBryan Len
 
Tutorial 3 getting started with omnet
Tutorial 3   getting started with omnetTutorial 3   getting started with omnet
Tutorial 3 getting started with omnetMohd Batati
 
Asynchronous transfer mode
Asynchronous transfer modeAsynchronous transfer mode
Asynchronous transfer modeVinil Patel
 
IoT and 5G: Opportunities and Challenges, SenZations 2015
IoT and 5G: Opportunities and Challenges, SenZations 2015IoT and 5G: Opportunities and Challenges, SenZations 2015
IoT and 5G: Opportunities and Challenges, SenZations 2015SenZations Summer School
 
Edge Computing : future of IoT ?
Edge Computing : future of IoT ? Edge Computing : future of IoT ?
Edge Computing : future of IoT ? Samir Bounab
 
On The Way To Smart Factory
On The Way To Smart FactoryOn The Way To Smart Factory
On The Way To Smart FactoryDell World
 
Near Field Communication by Mohammed Mudassir
Near Field Communication by Mohammed MudassirNear Field Communication by Mohammed Mudassir
Near Field Communication by Mohammed MudassirMohammed Mudassir
 
Seminar report of optical ethernet
Seminar report of optical ethernetSeminar report of optical ethernet
Seminar report of optical ethernetMohammad Waziruddin
 
Comparison of MQTT and DDS as M2M Protocols for the Internet of Things
Comparison of MQTT and DDS as M2M Protocols for the Internet of ThingsComparison of MQTT and DDS as M2M Protocols for the Internet of Things
Comparison of MQTT and DDS as M2M Protocols for the Internet of ThingsReal-Time Innovations (RTI)
 
Core cs overview (1)
Core cs overview (1)Core cs overview (1)
Core cs overview (1)Rashid Khan
 
SIP Trunking
SIP TrunkingSIP Trunking
SIP Trunkingorionnow
 
Trinity Smart Cities Presentation final
Trinity Smart Cities Presentation final Trinity Smart Cities Presentation final
Trinity Smart Cities Presentation final Ashok Kumar ( MAK )
 

What's hot (20)

Distributed Simulations with DDS and HLA
Distributed Simulations with DDS and HLADistributed Simulations with DDS and HLA
Distributed Simulations with DDS and HLA
 
Smartphone Hardware Architecture
Smartphone Hardware ArchitectureSmartphone Hardware Architecture
Smartphone Hardware Architecture
 
Understanding the Internet of Things Protocols
Understanding the Internet of Things ProtocolsUnderstanding the Internet of Things Protocols
Understanding the Internet of Things Protocols
 
Quic illustrated
Quic illustratedQuic illustrated
Quic illustrated
 
Advanced Industrial IoT, IIoT Training Crash Course For You - Tonex Training
Advanced Industrial IoT, IIoT Training Crash Course For You - Tonex TrainingAdvanced Industrial IoT, IIoT Training Crash Course For You - Tonex Training
Advanced Industrial IoT, IIoT Training Crash Course For You - Tonex Training
 
3GPP IMS
3GPP IMS3GPP IMS
3GPP IMS
 
Real time simulation with HLA and DDS
Real time simulation with HLA and DDSReal time simulation with HLA and DDS
Real time simulation with HLA and DDS
 
Tutorial 3 getting started with omnet
Tutorial 3   getting started with omnetTutorial 3   getting started with omnet
Tutorial 3 getting started with omnet
 
Asynchronous transfer mode
Asynchronous transfer modeAsynchronous transfer mode
Asynchronous transfer mode
 
IoT and 5G: Opportunities and Challenges, SenZations 2015
IoT and 5G: Opportunities and Challenges, SenZations 2015IoT and 5G: Opportunities and Challenges, SenZations 2015
IoT and 5G: Opportunities and Challenges, SenZations 2015
 
Edge Computing : future of IoT ?
Edge Computing : future of IoT ? Edge Computing : future of IoT ?
Edge Computing : future of IoT ?
 
On The Way To Smart Factory
On The Way To Smart FactoryOn The Way To Smart Factory
On The Way To Smart Factory
 
Near Field Communication by Mohammed Mudassir
Near Field Communication by Mohammed MudassirNear Field Communication by Mohammed Mudassir
Near Field Communication by Mohammed Mudassir
 
Seminar report of optical ethernet
Seminar report of optical ethernetSeminar report of optical ethernet
Seminar report of optical ethernet
 
Comparison of MQTT and DDS as M2M Protocols for the Internet of Things
Comparison of MQTT and DDS as M2M Protocols for the Internet of ThingsComparison of MQTT and DDS as M2M Protocols for the Internet of Things
Comparison of MQTT and DDS as M2M Protocols for the Internet of Things
 
Core cs overview (1)
Core cs overview (1)Core cs overview (1)
Core cs overview (1)
 
SIP Trunking
SIP TrunkingSIP Trunking
SIP Trunking
 
Ims Services
Ims ServicesIms Services
Ims Services
 
Nokia 3G UTRAN
Nokia 3G UTRANNokia 3G UTRAN
Nokia 3G UTRAN
 
Trinity Smart Cities Presentation final
Trinity Smart Cities Presentation final Trinity Smart Cities Presentation final
Trinity Smart Cities Presentation final
 

Viewers also liked

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynoteparallellabs
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera, Inc.
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Individual and societal risk
Individual and societal riskIndividual and societal risk
Individual and societal riskSruthi Madhu
 
Presentación de Moodle
Presentación de MoodlePresentación de Moodle
Presentación de Moodlecruizgaray
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkMikio L. Braun
 
El cambio
El cambioEl cambio
El cambiomemoop
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming SystemAshish Tadose
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis PatternsMikio L. Braun
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingApache Apex
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 

Viewers also liked (20)

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynote
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Nehalem (microarchitecture)
Nehalem (microarchitecture)Nehalem (microarchitecture)
Nehalem (microarchitecture)
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
8086ppt
8086ppt8086ppt
8086ppt
 
REDES NEURONALES
REDES NEURONALESREDES NEURONALES
REDES NEURONALES
 
Individual and societal risk
Individual and societal riskIndividual and societal risk
Individual and societal risk
 
Presentación de Moodle
Presentación de MoodlePresentación de Moodle
Presentación de Moodle
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into Flink
 
El cambio
El cambioEl cambio
El cambio
 
The influence-of-prayer-coping-on-patients
The influence-of-prayer-coping-on-patientsThe influence-of-prayer-coping-on-patients
The influence-of-prayer-coping-on-patients
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming System
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 

Similar to Intel's Nehalem Microarchitecture by Glenn Hinton

Intel Roadmap 2010
Intel Roadmap 2010Intel Roadmap 2010
Intel Roadmap 2010Umair Mohsin
 
Embedded Solutions 2010: Intel Multicore by Eastronics
Embedded Solutions 2010:  Intel Multicore by Eastronics Embedded Solutions 2010:  Intel Multicore by Eastronics
Embedded Solutions 2010: Intel Multicore by Eastronics New-Tech Magazine
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureBob Rhubart
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
Mellanox hpc day 2011 kiev
Mellanox hpc day 2011 kievMellanox hpc day 2011 kiev
Mellanox hpc day 2011 kievVolodymyr Saviak
 
Ngn2004 Moving Up And To The Edges110204
Ngn2004 Moving Up And To The Edges110204Ngn2004 Moving Up And To The Edges110204
Ngn2004 Moving Up And To The Edges110204guestf6c708
 
Fabric Arch Compet
Fabric Arch CompetFabric Arch Compet
Fabric Arch CompetHe Hariyadi
 
AMD Opteron 6200 and 4200 Series Presentation
AMD Opteron 6200 and 4200 Series PresentationAMD Opteron 6200 and 4200 Series Presentation
AMD Opteron 6200 and 4200 Series PresentationAMD
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMDKonrad Witte
 
Nanomanufacturing Presentation for UT in Silicon Valley 2013
Nanomanufacturing Presentation for UT in Silicon Valley 2013Nanomanufacturing Presentation for UT in Silicon Valley 2013
Nanomanufacturing Presentation for UT in Silicon Valley 2013Cockrell School
 
Ceragon Networks: Enabling Mobile Broadband in Asia - Creative Approaches
Ceragon Networks: Enabling Mobile Broadband in Asia - Creative ApproachesCeragon Networks: Enabling Mobile Broadband in Asia - Creative Approaches
Ceragon Networks: Enabling Mobile Broadband in Asia - Creative ApproachesCeragon
 
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本Riquelme624
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflowrjmurphyslideshare
 
GE Smallworld Network Inventory Overview
GE Smallworld Network Inventory OverviewGE Smallworld Network Inventory Overview
GE Smallworld Network Inventory Overviewcwilson5496
 

Similar to Intel's Nehalem Microarchitecture by Glenn Hinton (20)

Intel Roadmap 2010
Intel Roadmap 2010Intel Roadmap 2010
Intel Roadmap 2010
 
A series presentation
A series presentationA series presentation
A series presentation
 
Embedded Solutions 2010: Intel Multicore by Eastronics
Embedded Solutions 2010:  Intel Multicore by Eastronics Embedded Solutions 2010:  Intel Multicore by Eastronics
Embedded Solutions 2010: Intel Multicore by Eastronics
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the Future
 
Big Data Smarter Networks
Big Data Smarter NetworksBig Data Smarter Networks
Big Data Smarter Networks
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
Mellanox hpc day 2011 kiev
Mellanox hpc day 2011 kievMellanox hpc day 2011 kiev
Mellanox hpc day 2011 kiev
 
Ngn2004 Moving Up And To The Edges110204
Ngn2004 Moving Up And To The Edges110204Ngn2004 Moving Up And To The Edges110204
Ngn2004 Moving Up And To The Edges110204
 
Fabric Arch Compet
Fabric Arch CompetFabric Arch Compet
Fabric Arch Compet
 
DASH7 Mode 2 Summary
DASH7 Mode 2 Summary DASH7 Mode 2 Summary
DASH7 Mode 2 Summary
 
AMD Opteron 6200 and 4200 Series Presentation
AMD Opteron 6200 and 4200 Series PresentationAMD Opteron 6200 and 4200 Series Presentation
AMD Opteron 6200 and 4200 Series Presentation
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMD
 
Nanomanufacturing Presentation for UT in Silicon Valley 2013
Nanomanufacturing Presentation for UT in Silicon Valley 2013Nanomanufacturing Presentation for UT in Silicon Valley 2013
Nanomanufacturing Presentation for UT in Silicon Valley 2013
 
Ceragon Networks: Enabling Mobile Broadband in Asia - Creative Approaches
Ceragon Networks: Enabling Mobile Broadband in Asia - Creative ApproachesCeragon Networks: Enabling Mobile Broadband in Asia - Creative Approaches
Ceragon Networks: Enabling Mobile Broadband in Asia - Creative Approaches
 
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
 
EMC - 8sept2011
EMC - 8sept2011EMC - 8sept2011
EMC - 8sept2011
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
Delay Tolerant Streaming Services, Thomas Plagemann, UiO
Delay Tolerant Streaming Services, Thomas Plagemann, UiODelay Tolerant Streaming Services, Thomas Plagemann, UiO
Delay Tolerant Streaming Services, Thomas Plagemann, UiO
 
GE Smallworld Network Inventory Overview
GE Smallworld Network Inventory OverviewGE Smallworld Network Inventory Overview
GE Smallworld Network Inventory Overview
 
New Trends in Automation
New Trends in AutomationNew Trends in Automation
New Trends in Automation
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Intel's Nehalem Microarchitecture by Glenn Hinton

  • 1. Key Nehalem Choices Glenn Hinton Intel Fellow Nehalem Lead Architect Feb 17, 2010
  • 2. Outline • Review NHM timelines and overall issues • Converged core tensions • Big debate – wider vectors vs SMT vs more cores • How decided features • Power, Power, Power • Summary 2
  • 3. Tick-Tock Development Model: Pipelined developments – 5+ year projects Sandy Merom1 Penryn Nehalem Westmere Bridge NEW NEW NEW NEW NEW Microarchitecture Process Microarchitecture Process Microarchitecture 65nm 45nm 45nm 32nm 32nm TOCK TICK TOCK TICK TOCK Forecast • Merom started in 2001, released 2006 • Nehalem started in 2003 (with research even earlier) • Sandy bridge started in 2005 • Clearly already working on new Tock after Sandy Bridge • Most of Nehalem uArch was decided by mid 2004 – But most detailed engineering work happened in 2005/06/07 1Intel® Core™ microarchitecture (formerly Merom) 45nm next generation Intel® Core™ microarchitecture (Penryn) All dates, product descriptions, availability and plans are forecasts and subject to Intel® Core™ Microarchitecture (Nehalem) change without notice. Intel® Microarchitecture (Westmere) Intel® Microarchitecture (Sandy Bridge) 3 3
  • 4. Nehalem: Lots of competing aspects Consumer Desktop Learn Manageability Play & Security Multi- Enjoy Media core Processing Cell Health & Power/Perf Collaboration Wellness Efficiency SMCA Graphics 4/3/2/1 Seamless Processing Core Scalability *T’s Wireless Service Cable free ability TPCC computing Perf Increase Power Battery RAS features Technologies Life Reliability Threading Connectivity Reduce RAS Form features Factor Anywhere, Availability anytime computing Server 4
  • 5. The Blind Men and the Elephant It was six men of Indostan To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind … … And so these men of Indostan Disputed loud and long, Each in his own opinion Exceeding stiff and strong, Though each was partly in the right, And all were in the wrong! by John Godfrey Saxe 5
  • 6. “Converged Core” tradeoffs Common CPU Core for multiple uses •Mobile (Laptops) •Desktop •Server/HPC •Workloads? •How tradeoff? 6
  • 7. “Converged Core” tradeoffs • Mobile – 1/2/4 core options; scalable caches – Low TDP power CPU and GPU – Very low “average” power (great partial sleep state power) – Very low sleep state power – Low V-min to give best power efficiency when active – Moderate DRAM bandwidth at low power – Very dynamic power management – Low cost for volume – Great single threaded performance – Most apps single threaded • Desktop • Server • Workloads – Productivity, Media, ISPEC, FSPEC, 32 vs 64 bit • How tradeoff? 7
  • 8. “Converged Core” tradeoffs • Mobile • Desktop – 1/2/4 core options; scalable caches – Media processing, high end game performance – Moderate DRAM bandwidth – Low cost for volume – Great single threaded performance – Most apps single threaded • Server • Workloads – Productivity, Media, Games, ISPEC, FSPEC, 32 vs 64 bit • How tradeoff? 8
  • 9. “Converged Core” tradeoffs • Mobile • Desktop • Server – More physical address bits (speed paths, area, power) – More RAS features (ECC on caches, TLBs, Metc) – Larger caches, TLBs, BTBs, multi-socket snoop, etc – Fast Locks and multi-threaded optimizations – More DRAM channels (BW and capacity) & more external links – Dynamic power management – Many cores (4, 8, etc) so need low power per core – SMT gives large perf gain since threaded apps – Low V-min to allow many cores to fit in low blade power envelops • Workloads – Workstation, Server, ISPEC, FSPEC, 64 bit • How tradeoff? 9
  • 10. “Converged Core” tradeoffs • Mobile – 1/2/4 core options; scalable caches – Low TDP power CPU and GPU – Very low “average” power (great partial sleep state power) – Very low sleep state power – Low V-min to give best power efficiency when active – Moderate DRAM bandwidth at low power – Very dynamic power management – Low cost for volume – Great single threaded performance (most apps single threaded) • Desktop – 1/2/4 core options; scalable caches – Media processing, high end game performance – Moderate DRAM bandwidth – Low cost for volume – Great single threaded performance (most apps single threaded) • Server – More physical address bits (speed paths, area, power) – More RAS features (ECC on caches, TLBs, Metc) – Larger caches, TLBs, BTBs, multi-socket snoop, etc – Fast Locks and multi-threaded optimizations – More DRAM channels (BW and capacity) and more external links – Dynamic power management – Many cores (4, 8, etc) so need low power per core – SMT gives large perf gain since threaded apps – Low V-min to allow many cores to fit in low blade power envelops • Workloads – Productivity, Media, Workstation, Server, ISPEC, FSPEC, 32 vs 64 bit, etc • How tradeoff? 10
  • 11. Early Base Core Selection - 2003 • Goals – Best single threaded perf? – Lowest cost dual core? – Lowest power dual core? – Best laptop battery life? – Most cores that fit in server size? – Best total throughput for cost/power in multi-core? – Least engineering costs? • Major options – Enhanced Northwood (P4) pipeline? – Enhanced P6 pipeline (like Merom – Core 2 Duo)? – New from scratch pipeline? • Why went with enhanced P6 (Merom) pipeline? – Lower power per core, lower per core die size, lower total effort – Better SW optimization consistency • Likely gave up some ST perf (10-20%?) – But unlikely to have been able to do „bigger‟ alternatives 11
  • 12. 2004 Major Decision: Cores vs Vectors vs SMT • Just use older Penryn cores and have 3 or 4 of them? – No single threaded performance gains • Put in wider vectors (like recently announced AVX)? – 256bit wide vectors (called VSSE back then) – Very power and area efficient, if doing wide vectors – Consumes die size and power when not using • Add SMT per core and have fewer cores? – Very power and area efficient – Adds a lot of complexity; some die cost and power when not using • What do servers want? Lots of cores/threads • Laptops? Low power cores • HE Desktops? Great media/game performance • Options with similar die cost: – 2 enhanced cores + SMT + Wide Vectors? – 3 enhanced cores + SMT? – 4 simplified cores? 12
  • 13. 2 cores vs 4 Cores pros/cons • 2 Cores+VSSE+SMT • 4 Cores – Somewhat smaller die size – Better generic 4T perf – Lower power than 4 core – TPPC, multi-tasking – Better ST perf – Best media perf on legacy 4T-enabled – Best media if use VSSE? apps – Not clear – looks like a wash – Simpler HW design – Specialized MIPS sometimes unused – Die size somewhat bigger – VSSE gives perf for apps not easily – Simpler/harder SW enabling threaded – Simpler since no VSSE – Is threading really mainly for wizards? – No SMT asymmetries – New visible ISA feature like MMX, – Harder since general 4T SSE – 4T perf is also specialized – But probably less than VSSE – TDP Power somewhat higher – Somewhat lower average power – Since smaller single core – More granular to hit finer segments (1/2/3/4 core options) – More complex power management – Trade uArch change resources for power reduction 13
  • 14. Nehalem Direction 120.0 Specrate Core Power 100.0 • Tech Readiness Direction 2 Cores + VSSE + SMT 3 Cores + SMT – 4/3/2/1 cores supported 80.0 3 Cores Power – VSSE dropped 60.0 – Saves power and die size 40.0 – SMT maintained 20.0 • VSSE in core less valued by servers and mobile 0.0 0.50 0.70 0.90 1.10 1.30 1.50 1.70 – Casualty of Converged Core Perf – Utilize scaled 4 core solution DH / Media to recover media perf 120 • SMT to increase threaded perf 100 2 Cores + VSSE + SMT 3 Cores Core Power – Initially target servers 80 4 Cores • Spend more effort to reduce Power power 60 40 20 0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 Perf 14
  • 15. Early goal: Remove Multi-Core Perf tax (In power constrained usages) • Early Multi-cores lower freq than single core variants • When started Nehalem dual cores still 2-3 years away… – Wanted 4 cores in high-end volume systems – A big increase • Lower power envelopes planned for all Nehalem usages – Thinner laptops, blade servers, small-form-factor desktops, etc • Many apps still single threaded • All cores can have a lot of power - limits highest TDP freq – If just had one core the highest frequency could be a lot higher • Turbo Mode/Power Gates removed this multi-core tax – One of biggest ST perf gains for mobile/power constrained usages • Considered ways to do Turbo Mode – Decided must have flexible means to tune late in project – Added PCU with micro-controller to dynamically adapt 15
  • 16. Power Control Unit Vcc BCLK Core PLL Vcc Integrated proprietary microcontroller Freq. PLL Sensors Core Shifts control from hardware to embedded Vcc firmware Freq. PLL Sensors Real time sensors for temperature, current, Core PCU power Vcc Freq. PLL Flexibility enables Sensors sophisticated algorithms, tuned for Core Uncore, current operating Vcc LLC conditions Freq. PLL Sensors 16
  • 17. Intel® Core™ Microarchitecture (Nehalem) Turbo Mode Power Gating Zero power for inactive cores (C6 state) No Turbo Core 0 Core 2 Core 3 Core 1 Frequency (F) Core 0 Core 1 Core 2 Core 3 Frequency (F) Workload Lightly Threaded or < TDP 17 17
  • 18. Intel® Core™ Microarchitecture (Nehalem) Turbo Mode Power Gating Turbo Mode Zero power for inactive In response to workload cores adds additional performance bins within headroom No Turbo Core 0 Core 1 Frequency (F) Core 0 Core 1 Core 2 Core 3 Frequency (F) Workload Lightly Threaded or < TDP 18 18
  • 19. Intel® Core™ Microarchitecture (Nehalem) Turbo Mode Power Gating Turbo Mode Zero power for inactive In response to workload cores adds additional performance bins within headroom No Turbo Core 0 Core 1 Frequency (F) Core 0 Core 1 Core 2 Core 3 Frequency (F) Workload Lightly Threaded or < TDP 19 19
  • 20. uArch features – Converged Core • Difficult balancing the needs of the 3 conflicting requirements • All CPU core features must be very power efficient – Helps all segments, especially laptops and multi-core servers – Requirement was to beat a 2:1 power/perf ratio – Ended up more like 1.3:1 power/perf ratio for perf features added • Segment specific features can‟t add much power or die size • Initial Uncore optimized for 4 core DP server but had to scale down well to 2 core volume part • Many things mobile wanted also helped servers – Low power cores, lower die size per core, etc – Active power management – More synergy than originally thought 20
  • 21. Nehalem Power Efficiency Features • Only adding power efficient uArch features Approximate Relative Relative Core Power/Performance – Net power : performance ratio Core 1266 EOL Power/Perf of Nehalem core ~1.3 : 1 25.00 – Far better than voltage scaling • Reducing min operating voltage 20.00 Core Power/Perf Curves with linear freq decline Core Power – Cubic power reduction with ~linear perf reduction 15.00 Core Power • Implementing C6/Power Gated low-power state 10.00 – Provides significant reduction in average mobile power 5.00 • Turbo mode NHM PENRYN – Allows processor to utilize entire MEROMC available power envelope 0.00 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 – Reduces performance penalty Relative Performance to MeromC@1.1V from multi-core on ST apps 21
  • 22. Designed for Performance New SSE4.2 Improved Lock Additional Caching Instructions Support Hierarchy L1 Data Cache L2 Cache & Interrupt Execution Servicing Units Memory Ordering Deeper & Execution Paging Buffers Improved Branch Prediction Loop Out-of-Order Instruction Streaming Scheduling & Decode & Retirement Microcode Instruction Fetch & L1 Cache Simultaneous Better Branch Multi-Threading Faster Virtualization Prediction 22
  • 23. The First Intel® Core™ Microarchitecture (Nehalem) Processor Memory Controller M M i i s s c c Core Core Q Core Core I u I O e O u e Q Q P P I I Shared L3 Cache 0 1 QPI: Intel® QuickPath Interconnect (Intel® QPI) A Modular Design for Flexibility 23
  • 24. Scalable Cores Same core for Common software Common feature set all segments optimization Intel® Core™ Microarchitecture Servers/Workstations (Nehalem) Energy Efficiency, 45nm Performance, Virtualization, Reliability, Capacity, Scalability Desktop Performance, Graphics, Energy Efficiency, Idle Power, Security Mobile Battery Life, Performance, Energy Efficiency, Graphics, Security Optimized cores to meet multiple market segments 24
  • 25. Modularity 45nm Lynnfield/Clarksfield 45nm Nehalem Core i7 Misc IO Core Memory Controller Core Shared Queue L3 Cache Core iGFX/Media Complex Core Misc IO QPI 0 32nm Clarkdale/Arrandale Converged Core Architecture 25
  • 26. Intel® Xeon® Processor 5500 series based Server platforms Server Performance comparison to Xeon 5400 Series Relative Performance Higher is better Xeon 5500 vs Xeon 5400 on Server Benchmarks 2.66 2.52 2.54 2.28 2.30 2.03 1.93 1.72 1.74 1.77 1.64 1.00 Xeon 5400 series Server Energy Java App Floating Virtualizat Integer ERP Database Database Web Side Java Efficiency Apps Server point ion Baseline SPECjbb* SPECint*_ SPECpower*_ SPECjvm* SPECjApp* SAP-SD* 2- SPECfp*_ TPC*-C TPC*-E SPECWeb* VMmark* 2005 rate_ ssj2008 2008 Server2004 Tier rate_ 2005 base2006 base2006 Source: Published/submitted/approved results June 1,2009. See backup for additional details Leadership on key server benchmarks Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit 26 http://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. * Other names
  • 27. Intel® Xeon® Processor 5500 series based Server platforms HPC Performance comparison to Xeon 5400 Series Relative Performance Higher is better Xeon 5500 vs Xeon 5400 on HPC Benchmarks 2.92 2.95 2.96 2.55 2.54 2.39 2.27 2.12 2.15 2.17 1.94 1.00 Xeon 5400 series Weather FEA FEA CFD CFD Energy Open MP Energy Open MP Weather Energy Baseline MM5 *v4.7.4 - LS-DYNA* - 3 ANSYS* - Star-CD* A- ANSYS* CMG IMEX* SPECompM* Eclipse* - SPECompL* WRF* v3.0.1 - Landmark* t3a Vehicle Distributed Class FLUENT* 12.0 base2001 300/2mm base2001 12km CONUS Nexus Collision bmk Source: Published/submitted/approved results March 30, 2009. See backup for additional details Exceptional gains on HPC applications Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit 27 http://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. * Other names
  • 28. Summary • Nehalem was Intel‟s first truly „converged core‟ • Difficult tradeoffs between segments • Result: Outstanding Server, DT, and mobile parts • Acknowledge the outstanding Architects, Designers, and Validators that made this project a great success – A great team overcomes almost all challenges 28
  • 29. 29
  • 30. Intel® Xeon® 5500 Performance Publications SPECpower*_ssj2008 SPECint*_rate_base2006 SPECfp*_rate_base2006 241 score (+72%) 1977 ssj_ops/watt (+74%) 197 score (+128%) IBM J9* JVM SPECjAppServer*2004 TPC*-C SAP-SD* 2-Tier 3,975 JOPS (+93%) 631,766 tpmC (+130%) 5,100 SD Users (+103%) Oracle WebLogic* Server Oracle 11g* database SAP* ERP 6.0/IBM DB2* VMmark* TPC*-E SPECWeb*2005 24.35 @17 tiles (+166%) 800 tpsE (+152%) 75023 score (+150%) VMware* ESX 4.0 Microsoft SQL Server* 2008 Rock Web* Server Fluent* 12.0 benchmark SPECjbb*2005 SPECapc* for Maya 6.5 Geo mean of 6 (+127%) 604,417 BOPS (+64%) 7.70 score (+87%) ANSYS FLUENT* IBM J9* JVM Autodesk* Maya Over 30 New 2S Server and Workstation World Records! Percentage gains shown are based on comparison to Xeon 5400 series; Performance results based on published/submitted results as of April 27, 2009. Platform configuration details are available at http://www.intel.com/performance/server/xeon/summary.htm *Other names and brands may be claimed as the property of others Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to 30 evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products,
  • 31. Intel® Xeon® Processor 5500 Series based Platforms ISV Application Performance Xeon 5500 vs Xeon 5500 vs ISV Application Xeon 5400 ISV Application Xeon 5400 Ansys* CFX11* – CFD Simulation +88% Kingsoft* JXIII Online* – Game Server +98% ERDAS* ERMapper Suite* +69% Mediaware* Instream* – Video Conv. +73% ESI Group* PAM-CRASH* +50% Neowiz* Pmang* – Game Portal +100% ExitGames* Neutron* – Service Platform +80% Neusoft* – Healthcare +131% Epic* – EMR Solution +82% Neusoft* – Telecom BSS VMware* +115% Giant* Juren* – Online Game +160% NHN Corp* Cubrid* – Internet DB +44% IBM* DB2* 9.5 – TPoX XML +60% QlikTech* QlikView* – BI +36% IBM* Informix* Dynamic Server +84% SAS* Forecast Server* +80% IBM* solidDB* – In Memory DB +87% SAP* NetWeaver* – BI +51% Image Analyzer * – Image Scanner +100% SAP* ECC 6.0* – ERP Workload +71% Infowrap* – Small Data Set +129% Schlumberger* Eclipse300* +213% Infowrap* – Weather Forecast +155% SunGard* BancWare Focus ALM* +38% Intense* IECCM* – Output Mgmt. +150% Supcon* – APC Intel. Sensor +191% Intersystems* Cache* – EMR +63% TongTech* – Middleware +95% Kingdee* APUSIC* – Middleware +93% UFIDA* NC* – ERP Solution +230% Kingdee* EAS* – ERP +142% UFIDA* Online – SaaS Hosting +237% Kingdom* – Stock Transaction +141% Vital Images* – Brain Perfusion 4DCT* +77% Exceptional gains (1.6x to 3x) on ISV applications Source: Results measured and approved by Intel and ISV partners on pre-production platforms. March 30, 2009. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit 31 http://www.intel.com/performance/resources/limits.htm Copyright © 2009, Intel Corporation. * Other names
  • 32. Execution Unit Overview Execute 6 operations/cycle Unified Reservation Station • 3 Memory Operations • Schedules operations to Execution units • 1 Load • Single Scheduler for all Execution Units • 1 Store Address • Can be used by all integer, all FP, etc. • 1 Store Data • 3 “Computational” Operations Unified Reservation Station Port 0 Port 4 Port 2 Port 1 Port 3 Port 5 Integer ALU & Store Integer ALU & Load Integer ALU & Store Shift LEA Address Data Shift FP Multiply FP Add Branch Complex Divide FP Shuffle Integer SSE Integer ALU SSE Integer SSE Integer ALU Integer Shuffles Multiply Integer Shuffles 32
  • 33. Increased Parallelism 1 Concurrent uOps Possible • Goal: Keep powerful execution 128 engine fed 112 • Nehalem increases size of out-of- 96 80 order window by 33% 64 • Must also increase other 48 corresponding structures 32 16 0 Dothan Merom Nehalem Structure Intel® Core™ Intel® Core™ Comment microarchitecture microarchitecture (formerly Merom) (Nehalem) Reservation Station 32 36 Dispatches operations to execution units Load Buffers 32 48 Tracks all load operations allocated Store Buffers 20 32 Tracks all store operations allocated Increased Resources for Higher Performance 1Intel® Pentium® M processor (formerly Dothan) Intel® Core™ microarchitecture (formerly Merom) Intel® Core™ microarchitecture (Nehalem) 33
  • 34. New TLB Hierarchy • Problem: Applications continue to grow in data size • Need to increase TLB size to keep the pace for performance • Nehalem adds new low-latency unified 2nd level TLB # of Entries 1st Level Instruction TLBs Small Page (4k) 128 Large Page (2M/4M) 7 per thread 1st Level Data TLBs Small Page (4k) 64 Large Page (2M/4M) 32 New 2nd Level Unified TLB Small Page Only 512 34
  • 35. Faster Synchronization Primitives 1 LOCK CMPXCHG Perform ance • Multi-threaded software 1 becoming more prevalent 0.9 0.8 • Scalability of multi-thread 0.7 applications can be limited by Relative Latency 0.6 synchronization 0.5 0.4 • Synchronization primitives: 0.3 LOCK prefix, XCHG 0.2 • Reduce synchronization latency 0.1 0 for legacy software Pentium 4 Core 2 Nehalem Greater thread scalability with Nehalem 1Intel® Pentium® 4 processor Intel® Core™2 Duo processor Intel® Core™ microarchitecture (Nehalem)-based processor 35
  • 36. Intel® Hyper-Threading Technology • Also known as Simultaneous Multi-Threading (SMT) – Run 2 threads at the same time per core w/o SMT SMT • Take advantage of 4-wide execution engine – Keep it fed with multiple threads – Hide latency of a single thread • Most power efficient performance feature Time (proc. cycles) – Very low die area cost – Can provide significant performance benefit depending on application – Much more efficient than adding an entire core • Intel® Core™ microarchitecture (Nehalem) advantages – Larger caches – Massive memory BW Note: Each box represents a processor execution unit Simultaneous multi-threading enhances performance and energy efficiency 36
  • 37. Intel® Hyper-Threading Technology • Nehalem is a scalable multi-core architecture • Hyper-Threading Technology augments benefits – Power-efficient way to boost performance in all form factors: higher multi-threaded performance, faster multi-tasking response Hyper-Threading Multi-cores Without HT Shared or Technology Partitioned Replicated Replicated Register State X X Return Stack X X Reorder Buffer X X Instruction TLB X X Reservation Stations X X Cache (L1, L2) X X With HT Data TLB X X Technology Execution Units X X • Next generation Hyper-Threading Technology: – Low-latency pipeline architecture – Enhanced cache architecture – Higher memory bandwidth Enables 8-way processing in Quad Core systems, 4-way processing in Small Form Factors Intel® Microarchitecture codenamed Nehalem 37
  • 38. Caches • New 3-level Cache Hierarchy • 1st level caches – 32kB Instruction cache – 32kB, 8-way Data Cache – Support more L1 misses in parallel than Intel® Core Core™2 microarchitecture 32kB L1 32kB L1 • 2nd level Cache Data Cache Inst. Cache – New cache introduced in Intel® Core™ microarchitecture (Nehalem) – Unified (holds code and data) – 256 kB per core (8-way) – Performance: Very low latency – 10 cycle load-to-use – Scalability: As core count increases, reduce 256kB L2 Cache pressure on shared cache 38
  • 39. 3rd Level Cache • Shared across all cores • Size depends on # of cores – Quad-core: Up to 8MB (16-ways) – Scalability: Core Core Core – Built to vary size with varied core counts L1 Caches L1 Caches L1 Caches – Built to easily increase L3 size in future parts • Perceived latency depends on frequency … ratio between core & uncore L2 Cache L2 Cache L2 Cache • Inclusive cache policy for best performance L3 Cache – Address residing in L1/L2 must be present in 3rd level cache 39
  • 40. Hardware Prefetching (HWP) • HW Prefetching critical to hiding memory latency • Structure of HWPs similar as in Intel® Core™2 microarchitecture – Algorithmic improvements in Intel® Core™ microarchitecture (Nehalem) for higher performance • L1 Prefetchers – Based on instruction history and/or load address pattern • L2 Prefetchers – Prefetches loads/RFOs/code fetches based on address pattern – Intel Core microarchitecture (Nehalem) changes: – Efficient Prefetch mechanism – Remove the need for Intel® Xeon® processors to disable HWP – Increase prefetcher aggressiveness – Locks on address streams quicker, adapts to change faster, issues more prefetchers more aggressively (when appropriate) 40
  • 41. SW Prefetch Behavior • PREFETCHT0: Fills L1/L2/L3 • PREFETCHT1/T2: Fills L2/L3 • PREFETCHNTA: Fills L1/L3, L1 LRU is not updated • SW prefetches can conduct page walks • SW prefetches can spawn HW prefetches – SW prefetch caching behavior not obeyed on HW prefetches 41
  • 42. Memory Bandwidth – Initial Intel® Core™ Microarchitecture (Nehalem) Products • 3 memory channels per socket Stream Bandwidth – Mbytes/Sec (Triad) • ≥ DDR3-1066 at launch • Massive memory BW • Scalability 33376 – Design IMC and core to take advantage of BW 3.4X – Allow performance to scale with cores – Core enhancements – Support more cache misses per core – Aggressive hardware prefetching w/ throttling enhancements – Example IMC Features 9776 – Independent memory channels – Aggressive Request Reordering 6102 HTN 3.16/ BF1333/ HTN 3.00/ SB1600/ NHM 2.66/ 6.4 QPI/ 667 MHz mem 800 MHz mem 1066 MHz mem Source: Intel Internal measurements – August 20081 Massive memory BW provides performance and scalability 1HTN: Intel® Xeon® processor 5400 Series (Harpertown) NHM: Intel® Core™ microarchitecture (Nehalem) 42
  • 43. Memory Latency Comparison • Low memory latency critical to high performance • Design integrated memory controller for low latency • Need to optimize both local and remote memory latency • Intel® Core™ microarchitecture (Nehalem) delivers – Huge reduction in local memory latency – Even remote memory latency is fast • Effective memory latency depends per application/OS – Percentage of local vs. remote accesses – Intel Core microarchitecture (Nehalem) has lower latency regardless of mix 1 Relative Memory Latency Comparison 1.00 Relative Memory Latency 0.80 0.60 0.40 0.20 0.00 Harpertow n (FSB 1600) Nehalem (DDR3-1067) Local Nehalem (DDR3-1067) Remote 1Next generation Quad-Core Intel® Xeon® processor (Harpertown) Intel® CoreTM microarchitecture (Nehalem) 43
  • 44. Virtualization • To get best virtualized performance – Have best native performance – Reduce: – # of transitions into/out of virtual machine – Latency of transitions • Intel® Core™ microprocessor (Nehalem) virtualization features – Reduced latency for transitions – Virtual Processor ID (VPID) to reduce effective cost of transitions – Extended Page Table (EPT) to reduce # of transitions Great virtualization performance with Intel® Core™ microarchitecture (Nehalem) 44
  • 45. Latency of Virtualization Transitions • Microarchitectural – Huge latency reduction generation Round Trip Virtualization Latency 1 over generation – Nehalem continues the trend 100% • Architectural Relative Latency – Virtual Processor ID (VPID) added 80% in Intel® Core™ microarchitecture 60% (Nehalem) – Removes need to flush TLBs on 40% transitions 20% 0% Merom Penryn Nehalem Higher Virtualization Performance Through Lower Transition Latencies 1Intel® Core™ microarchitecture (formerly Merom) 45nm next generation Intel® Core™ microarchitecture (Penryn) Intel® Core™ microarchitecture (Nehalem) 45
  • 46. Extended Page Tables (EPT) Motivation VM1 • A VMM needs to protect Guest OS physical memory • Multiple Guest OSs share CR3 the same physical memory Guest Page Table • Protections are implemented through Guest page table changes page-table virtualization cause exits into the VMM • Page table virtualization VMM accounts for a significant portion of CR3 virtualization overheads Active Page Table • VM Exits / Entries • The goal of EPT is to VMM maintains the active page table, which is used reduce these overheads by the CPU 46
  • 47. EPT Solution EPT CR3 Base Pointer Guest Guest Host Intel® 64 EPT Linear Page Tables Physical Page Tables Physical Address Address Address • Intel® 64 Page Tables – Map Guest Linear Address to Guest Physical Address – Can be read and written by the guest OS • New EPT Page Tables under VMM Control – Map Guest Physical Address to Host Physical Address – Referenced by new EPT base pointer • No VM Exits due to Page Faults, INVLPG or CR3 accesses 47
  • 48. Intel® Core™ Microarchitecture (Nehalem-EP) Platform Architecture • Integrated Memory Controller – 3 DDR3 channels per socket – Massive memory bandwidth – Memory Bandwidth scales with # of Nehalem Nehalem processors EP EP – Very low memory latency • Intel® QuickPath Interconnect (Intel® QPI) Tylersburg – New point-to-point interconnect EP – Socket to socket connections – Socket to chipset connections – Build scalable solutions Significant performance leap from new platform Intel® Core™ microarchitecture (Nehalem-EP) Intel® Next Generation Server Processor Technology (Tylersburg-EP) 48
  • 49. Intel Next-Generation Mainstream Processors1 Feature Core™ i7 Lynnfield Clarkdale Clarksfield Arrandale Processing Threads [via Intel® Hyper-Threading 8 Up to 8 Up to 4 Up to 8 Up to 4 Technology (HT)] Processor Cores 4 4 2 4 2 Shared Cache 8MB Up to 8MB Up to 4MB Up to 8MB Up to 4MB Integrated Memory 3 ch. DDR3 2 ch. DDR3 Controller Channels DDR Freq Support 800, 1066 1066, 1333 800, 1066 (sku dependent) # DIMMs/Channels 2 2 1 2x16 or 4x8, 1x4 PCI Express* 2.0 1x16 or 2x8 1x16 or 2x8 1x16 or 2x8 1x16 (1.0) (via X58) Processor Graphics No No Yes No Yes Processor Package TDP 130W 95W 73W 55W and 45W 35W, 25W, 18W Socket LGA 1366 LGA 1156 rPGA, BGA Platform Support X58 & ICH10 Intel® 5 series Chipset Processor Core Process 45nm 45nm 32nm 45nm 32nm Technology Bringing Intel® Core™ i7 Benefits into Mainstream 1Not all features are on all products, subject to change 49