Hot chips24 fujitsu_presentation


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hot chips24 fujitsu_presentation

  1. 1. SPARC64™ X:Fujitsu’s New Generation 16 Core Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright© FUJITSU LIMITED 2012
  2. 2. Agenda Fujitsu Processor Development History SPARC64TM X  Design concept  SWoC (Software on Chip)  Processor chip overview  u-Architecture  Performance SummarySPARC64™ X 2 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  3. 3. Fujitsu Processor Development Tr=2.95B Virtual Machine Architecture CMOS Cu 28nm Software On Chip Tr=1B CMOS Cu HPC-ACE 40nm SPARC64™ SPARC64 System On Chip Tr=760M SPARC64 X CMOS Cu Processor 45nm IXfx SPARC64 Tr=600M VIIIfx Hardware Barrier CMOS Cu 65nmHigh Performance Tr=540M Multi-core Multi-thread CMOS Cu SPARC64 Technology Tr=400M 90nm VII CMOS Cu L2$ on Die 90nm SPARC64 VI Tr=190M0 CMOS Cu Non-Blocking $ 130nm SPARC64 V+ GS21 O-O-O Execution SPARC64 Super-Scalar Tr=46M V GS21 900 Tr=30M CMOS Cu Single-chip CPU CMOS Al 180nm GS21 Tr=500M 250nm / 220nm 600 CMOS Cu GS8900 90nm Store Ahead Tr=190M Tr=10M GS8800B Branch History CMOS Al GS8800 SPARC64 CMOS Cu 350nm GP 130nm Prefetch MainframeHigh Reliability SPARC64 Tr=30M $ ECC Technology GS8600 GP CMOS Cu 180nm / 150nm Register/ALU Parity SPARC64 Instruction Retry II SPARC64 :Technology generation $ Dynamic Degradation RC/RT/History ~1999 2000~2003 2004~2007 2008~2011 2012~SPARC64™ X 3 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  4. 4. SPARC64™ X Design Concept  Combine UNIX and HPC FJ processor features to realize an extremely high throughput UNIX processor.  SPARC64 VII/VII+ (UNIX processor) feature  High CPU frequency (up-to 3GHz)  Multicore/Multithread  Scalability : up-to 64sockets  SPARC64 VIIIfx (HPC processor) feature  HPC-ACE: Innovative ISA extensions to SPARC-V9  High Memory B/W: peak 64GB/s, Embedded Memory Controller  Add new features vital to current and future UNIX servers  Virtual Machine Architecture  Software On Chip  Embedded IOC (PCI-GEN3 controller)  Direct CPU-CPU interconnectSPARC64™ X 4 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  5. 5. Software on Chip 1/2  HW for SW Accelerates specific software function with HW  The targets  Decimal operation (IEEE754 decimal and NUMBER)  Cypher operation (AES/DES)  Database acceleration  HW implementation  The HW engines for SWoC are implemented in FPU • To fully utilize 128 FP registers & software pipelining  Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW.  Avoid complication due to “CISC” type instructions • Various “RISC” type instructions are newly defined, instead. • 18 insts. for Decimal, and 10 insts. for Cypher operationSPARC64™ X 5 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  6. 6. Software on Chip 2/2 Decimal Instructions  Supported data type  IEEE754 DPD(Densely Packed Decimal) 8B fixed length  NUMBER Variable length (max 21Byte)  Instructions  Both DPD/NUMBER instructions are defined as 8B operation (add/sub/mul/div/cmp) on FP registers Fd[rs1] Fd[rs2]  To maximize performance with reasonable HW cost 0 0  When the data length is > 8byte, multiple such and and instructions will be used. 0 1.........1 0 1........1 0............0  An instruction for special byte-shift on FP 0 0 0 registers is newly added to support unaligned NUMBER 0 Fd[rd]SPARC64™ X 6 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  7. 7. SPARC64™ X Chip Overview  Architecture Features DDR3 Interface • 16 cores x 2 threads MAC • SWoC (Software on Chip) Core Core Core Core • Shared 24 MB L2$ Core Core Core Core • Embedded Memory and IO ControllerSERDES SERDES L2 Cache L2 Cache L2 Cache PCI GEN3 Data Control Data Inter-CPU  28nm CMOS • 23.5mm x 25.0mm Core Core Core Core • 2,950M transistors • 1,500 signal pins Core Core MAC Core Core • 3GHz DDR3 Interface  Performance (peak) • 288GIPS/382GFlops • 102GB/s memory throughputSPARC64™ X 7 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  8. 8. SPARC64™ X Core spec Instruction SPARC-V9/JPS L1$ L1D$ Set HPC-ACE L1I$ control Architecture VM SWoC Execution Branch 4K BRHIS Unit Prediction 16K PHT Instruction Control Register File Integer 156 GPR x 2 + 64 GUB Execution ALU/SHIFT x2 Units ALU/AGEN x2 MULT/DIVIDE x1 FP 128 FPR x 2 + 64 FUB Execution FMA x4, FDIV x2 Units IMA/Logic x4 Decimal x1 / Cypher x2 L1$ L1I$ 64KB/4way L1D$ 64KB/4waySPARC64™ X 8 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  9. 9. u-Architecture enhancements from SPARC64TM VII+  CPU Core  Deeper pipeline to increase Frequency  Better Branch Prediction Scheme  Various Queue-size and #Floating point register increase  Richer execution Units, including  2EX + 2EAG → 2EX + 2EX/EAG  2FMA → 4FMA to support 2way-SIMD  SWoC engine (Decimal and Cypher)  More aggressive O-O-O execution of load and store  Multi-banked 2port L1-Cache  System On Chip  #core and L2$ size (4core/12MB→16core/24MB)  Memory Controller, IO Controller, and CPU-CPU I/F are all embedded to increase performance and reduce cost.SPARC64™ X 9 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  10. 10. SPARC64TM VII/VII+ Pipeline Fetch Issue Dispatch Reg.-Read Execute Memory Commit (4 stages) (2 stages) (4 stages) (L1$: 3 stages) (2 stages) CSE 64Entry Fetch Port EAGA 16Entry Decode GPR PC L1 I$ RSA 156Registers 64KB & Issue 10Entry Store x2 2Way x2 EAGB Port L1 D$ 64KB Control 16Entry EXA 2Way Registers RSE GUB Store x2 8x2Entry 32Registers EXB Buffer Branch 16Entry Target Address FLA 8Kentry RSF FPR 64Registers 8x2Entry x2 FLB RSBR FUB 10Entry 48Registers L2$ 4-core 6MB/12MB 12Way System Bus InterfaceSPARC64™ X 10 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  11. 11. SPARC64TM X Pipeline Fetch Issue Dispatch Reg.-Read Execute Memory Commit (4 stages) (4 stages) (5 stages) (L1$: 3 stages) (2 stages) CSE 96Entry Fetch EAGA Port 32Entry L1 I$ Decode RSA GPR EXC PC GPR PC 64KB & Issue 24Entry 156Registers EAGB Store 4Way 156Registers EXD Port L1 D$ 64KB Control 24Entry Control EXA 4Way Registers Registers RSE GUB Write 24Entry 64Registers Buffer Branch EXB 10Entry Target FLA Address 4Kentry Decimal RSF FPR FPR 20Entry 128Registers Cypher 128Registers Pattern History FLB Table FLC 16Kentry RSBR FUB Cypher 16Entry 64Registers FLD L2$ 16-core … 24MB 24Way Memory IO Router Controller Controller CPU-CPU I/F PCI-GEN3SPARC64™ X 11 DIMM All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  12. 12. Execution units enhancements (Ex.)  Integer Execution Unit Update Commit  2EX + 2EAG → 2EX + 2EX/EAG EXA  2 → 4W GPR EXB → 4 integer instructions can be executed EXC GUB GPR per cycle (sustained) EXD L1$  Load Store Unit  Aggressive load/store O-O-O execution:  Execute load without waiting for preceding store L1 cache address calculation. 16B load 2R/1W x2  Multi-banked 2port L1-cache to execute 2 load or 1 load+1 store in parallel …. (banked)  Doubled L1$ bus size 16B store  Doubled L1$ associativity (2→4way) 2R/1W → Increase L1-cache throughput and hit-rateSPARC64™ X 12 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  13. 13. SPARC64TM X interconnectsSPARC64TM VII/VII+ interconnects(SPARC Enterprise M8000)  SPARC64TM VII/VII+ interconnects DDR2 CPU SC MAC DIMMs  4 CPU require 8 additional LSIs to be CPU SC MAC DDR2 DIMMs connected with DIMM CPU SC MAC DDR2  DIMM i/f: 4.35GB/s (STREAMtriad) DIMMs DDR2 CPU SC MAC DIMMsSPARC64TM X interconnects  SPARC64TM X interconnects CPU 102GB/s DDR3  No additional LSIs to be connected DIMMs DDR3 with DIMM CPU DIMMs  DIMM i/f: 65.6GB/s (STREAMtriad) DDR3 CPU DIMMs  CPU i/f: 14.5GB/s x 5ports (peak) 14.5GB/s CPU DDR3 DIMMs • 3 ports: glueless 4way CPU interconnect • 2 ports: > 4way CPUSPARC64™ X 13 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  14. 14. High Speed Transceivers (SerDes)  CPU-CPU glue-less communication links  14.5Gb/s x 8 lanes bi-directional serial interface, 5 ports RX TX  Embedded equalizer circuit enables long distance signal transmission  Embedded adaptive control logic optimizes equalizer parameters automatically PLL depending on the various system configurations RX Tx  PCI Express ports  8Gb/s x 8 lanes (Gen 3), 2 ports 14.5Gb/s x 8lanes SerDes  Built-in SerDes provides peak 88.5GB/s x2 (up/down) total throughputSPARC64™ X 14 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  15. 15. Reliability, Availability, Serviceability SPARC64TM X RAS diagram Units Error detection and correction scheme Cache (Tag) ECC Duplicate & Parity Cache (Data) ECC Parity Register ECC (INT/FP) Parity(Others) ALU Parity/Residue Cache dynamic Yes degradation HW Instruction Retry Yes Green: 1bit error Correctable History Yes Yellow: 1bit error Detectable Gray: 1bit error harmless  New RAS features from SPARC64™ VII/VII+  Floating-point registers are ECC protected  #checkers increased to ~53,000 to identify a failure point more precisely → Guarantees Data IntegritySPARC64™ X 15 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  16. 16. Hardware Instruction Retry Instruction Retry Fetch Execute Commit Fetch Execute Commit 1. Error 2. Flush 3. Single step execution 4. Update of SW visible resources IBF IWR CSE X PC IBF IWR CSE PC RSE,RSF RSE,RSF ALU GPR,FPR ALU GPR,FPR RSA RSA EAGA/B EAGA/B EXA/B RSBR Memory EXA/B RSBR Memory FLA/B GUB,FUB FLA/B GUB,FUB SW visible SW visible resources resources 5. Back to normal execution after the re-executed Instruction gets committed without an error.  When an error is detected, Hardware re-execute the instruction automatically to remove the transient error by itself.SPARC64™ X 16 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  17. 17. SPARC64TM X Performance @3GHz Hardware measured results 98x SWoC 15x Relative to SPARC64TM VII+@2.86GHz 7x 8x SPARC64TM X realizes 7x INT/FP/JVM throughput and 15x memory throughput of SPARC64TM VII+  The INT/FP/JVM result is with un-tuned Compiler/JVM. SWoC of SPARC64TM X results in max 98x throughput.  The NUMBER score is for scalar. Expect to be much better for vector data.SPARC64™ X 17 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  18. 18. SPARC64TM X CPI (Cycle Per Instruction) Example SPARC64TM VII+ v.s. SPARC64TM X INT (single thread) Hardware measured resultsLowerPerformance Shorter memory latency Large L2$ Improved Branch prediction 2EX+2EAG→2EX+2EX/EAG 2→4way L1$ 2→4W GPR Increased throughput of L1$HigherPerformance  4 integer execution units and write port increase of GPR (integer register) improves overall performance.  Memory latency reduction, Large L2$, branch prediction, and L1$ improvement also contribute to the high performance dramatically.SPARC64™ X 18 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  19. 19. Summary  SPARC64TM X is Fujitsu’s10th SPARC processor which has been designed to be used for Fujitsu’s next generation UNIX server.  SPARC64TM X integrates 16cores + 24MB L2 cache with over 100GB/s(peak) memory B/W.  SPARC64TM X keeps strong RAS features.  SPARC64TM X chip is up and running in the lab.  It has shown 7 times throughput of SPARC64TM VII+ w/o compiler tuning.  SWoC is effective to accelerate specific software functions  Fujitsu will continue to develop SPARC64TM series.SPARC64™ X 19 All Rights Reserved, Copyright© FUJITSU LIMITED 2012
  20. 20. Abbreviations • SPARC64TM X – IB: Instruction Buffer – RSA: Reservation Station for Address generation – RSE: Reservation Station for Execution – RSF: Reservation Station for Floating-point – RSBR: Reservation Station for Branch – GUB: General Update Buffer – FUB: Floating point Update Buffer – GPR: General Purpose Register – FPR: Floating Point Register – CSE: Commit Stack EntrySPARC64™ X 20 All Rights Reserved, Copyright© FUJITSU LIMITED 2012