Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures

5,297 views

Published on

From the ISSCC 2018 presentation by Noah Beck, Sean White, Milam Paraschou and Samuel Naffziger.

Published in: Technology

ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures

  1. 1. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 1 of 29 “Zeppelin”: an SoC for Multi-chip Architectures Noah Beck1, Sean White1, Milam Paraschou2, Samuel Naffziger2 1AMD, Boxborough, 2AMD, Fort Collins Presented at ISSCC 2018
  2. 2. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 2 of 29 Outline ▪ Design Goals for the System-on-a-Chip codenamed “Zeppelin” ▪ SoC Architecture ▪ Core Complex codenamed “Zen” ▪ AMD Infinity Fabric (IF) ▪ I/O Capabilities, I/O muxing ▪ Floorplan and Packaging ▪ Results
  3. 3. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 3 of 29 “Zeppelin” SoC Goals Design a System-on-a-Chip Solution for scalability across the Server market ▪ 4-die multi-chip module (MCM) for Server in new infrastructure ▪ Same SoC suitable for High-End Desktop – 1-die Desktop in existing AM4 infrastructure – 2-die MCM High-End Desktop in new infrastructure
  4. 4. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 4 of 29 “Zeppelin” Die Functional Overview ▪ Compute – 8 “Zen” x86 cores – 4MB total L2 cache – 16 MB total L3 cache ▪ Memory – 2 channel DDR4 with ECC – 2 DIMMs/channel and up to 256GB/channel ▪ Integrated I/O – Coherent and control Infinity Fabric links – 32 lanes high-speed SERDES – 4 USB3.1 Gen1 ports – Server Controller Hub (SPI, LPC, UART, I2C, RTC, SMBus) IFIS/PCIe® IFOP IFOP Zen Zen Zen Zen L3 Zen Zen Zen Zen L3 IFOP IFIS/PCIe/SATAIFOP DDR DDR
  5. 5. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 5 of 29 Chip Architecture Infinity Fabric Scalable Data Fabric plane IFIS/PCIe® IFIS/PCIe/SATA IFOP IF SCF SMU CAKE CAKE CAKE PCIe Southbridge IO Complex SATA PCIe CCM CCX 4 cores + L3 CAKE CAKE IFOP CCX 4 cores + L3 CCM IFOP DDRDDRIFOP IOMS CAKE UMC UMC
  6. 6. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 6 of 29 CCX: CPU Complex ▪ 4 cores with L1/L2 caches, plus shared L3 cache ▪ “Zen” core described in [Singh ISSCC17] – L1 Instruction Cache 64KB, 4-way associative – L1 Data Cache 32KB, 8-way associative – L2 Cache 512KB, 8-way associative – 2 threads per core ▪ L3 cache 8MB, 16-way associative, shared by all four cores CORE 3 CORE L3M 1MB L 3 C T L L 2 C T L L2M 512K L3M 1MB
  7. 7. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 7 of 29 ▪ Fast private L2 cache, 12 cycles ▪ Fast shared L3 cache, 35 cycles ▪ L3 filled from L2 victims of all four cores ▪ L2 tags duplicated in L3 for probe filtering and fast cache transfer ▪ Multiple smart prefetchers ▪ 50 outstanding misses from L2 to L3 per core ▪ 96 outstanding misses from L3 to memory “Zen” Cache hierarchy 32B fetch 32B/ cycle CORE 0 32B/ cycle 2*16B load 8M L3 I+D Cache 16-way 32K D-Cache 8-way 64K I-Cache 4-way 512K L2 I+D Cache 8-way 1*16B store 32B/ cycle 32B/ cycle
  8. 8. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 8 of 29 AMD Infinity Fabric: Scalable Data Fabric SDF Transport Layer CAKE IFIS or IFOP to off-chip IOMS IO Complex CCM CCX CCM CCX UMC DDR4 UMC DDR4 I/O Master/Slave Unified Memory Controller Coherent AMD SocKet Extender IF Inter- Socket SerDes IF On-Package SerDes Cache-Coherent Master Core Complex
  9. 9. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 9 of 29 SDF Local Memory Access SDF Transport Layer CAKE IFIS or IFOP to off-chip IOMS IO Complex CCM CCX CCM CCX UMC DDR4 UMC DDR4 Latency to local memory: ~90ns * See Endnotes for additional system configuration details
  10. 10. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 10 of 29 SDF Die-to-Die Memory Accesses SDF Transport Layer CAKE IFOP CCM CCX CCM CCX UMC DDR4 Latency to other memory within socket: ~145ns CAKEIFIS CAKE SDF Transport Layer Latency to memory attached to other socket (single hop): ~200ns IFOP * See Endnotes for additional system configuration details UMCDDR4 CAKE SDF Transport Layer IFIS Other socket die Same package die
  11. 11. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 11 of 29 ▪ Low-swing, single-ended data for ~50% of power of an equivalent differential driver ▪ Zero power driver state during logic 0 transmit – Transmit/receive impedance termination to ground while driver pullup is disabled – Also applied during link idle ▪ Data bit inversion encoding saving 10% average power per bit 2pJ/bit IFOP SerDes
  12. 12. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 12 of 29 Hierarchical Power Management ▪ System Management Unit (SMU) uses IF Scalable Control Fabric (SCF) plane ▪ SCF: single-lane IFIS SerDes link for chip-to-chip or socket-to-socket ▪ SMU calculation hierarchy for voltage level control, C-State Boost, thermal management, electrical design current management – Local chip SMU fast loop – Master chip SMU slower loop Die 2 Die 1 Die 3 Die 0 SMU SMU SMU Master SMU To other socket
  13. 13. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 13 of 29 IO Subsystem & Muxing ▪ 32 lanes multi-protocol I/O – PCIe, IFIS: two 16-lane links – PCIe link bifurcation: max 8 devices per 16-lane link – SATA: 8 lanes of bottom link ▪ Supports multiple market segments ▪ Muxing support adds <1 channel clock latency to IFIS 16-lane link x16 x16 x8 x8 x4 x4 x4 x4 x2 x2 x2 x2 x2 x2 x2 x2 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x2 x2 x2 x2 x2 x2 x2 x2 x4 x4 x4 x4 16-lane link x8 x8 x16 x16 IFIS PCIe SATA x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 I/O CCX CCX DDR I/O
  14. 14. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 14 of 29 Chip Floorplanning for Package ▪ DDR placement on one die edge ▪ Chips in 4-die MCM rotated 180°, DDR facing package left/right edges ▪ Package-top Infinity Fabric pinout requires diagonal placement of IFIS ▪ 4th IFOP enables routing of high- speed I/O in only four package substrate layers I/O CCX CCX DDR I/O
  15. 15. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 15 of 29 DDR+IFOP Package Routing ▪ Vertical and Horizontal IFOP: 2 layers each ▪ Diagonal IFOP: 1 layer each ▪ DDR channel: 1 layer each Layer A Layer B I/ODDR Die2 CCX CCX I/O DDR Die1 CCX CCX I/O I/ODDR Die3 CCX CCX I/O DDR Die0 CCX CCX I/O I/OI/O
  16. 16. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 16 of 29 DDR+IFIS Package Routing ▪ DDR channel: 1 layer each ▪ IFIS links: 2 layers each Layer C Layer D I/ODDR Die2 CCX CCX I/O DDR Die1 CCX CCX I/O I/ODDR Die3 CCX CCX I/O DDR Die0 CCX CCX I/O I/OI/O
  17. 17. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 17 of 29 MCM Versus Single-Chip Design ▪ 4-die MCM package: 852mm2 of silicon (4 * 213mm2) ▪ Large single-chip design: – ~10% area savings: 777mm2 (near reticle size limit) – Manufacturing/test cost: ~40% higher – Full 32-core yield: ~17% lower – Full 32-core cost: ~70% higher ▪ High-yielding multi-chip assembly process – Achievable based on internal production data – Die frequency matching using on-die frequency sensors
  18. 18. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 18 of 29 MCM Package Achievements ▪ 4094 total LGA pins ▪ 58mm x 75mm organic substrate ▪ 534 IF high-speed chip-to-chip nets – Over 256GB/s total in-package bandwidth ▪ 1760 high-speed pins – Over 450GB/s total off-package bandwidth
  19. 19. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 19 of 29 More MCM Package Achievements ▪ ~300µF of on-package cap ▪ ~300A current ▪ Up to 200W TDP Core supply pins, 180A Uncore supply pins, 65A 1.2Vsupplypins,30A 1.2Vsupplypins,30A
  20. 20. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 20 of 29 MCM Core Voltage Variation ▪ Per-core measurements shown – +/-25mV accuracy with max power workload ▪ Per-core ring oscillators – Calibrated for temperature and voltage – Min/max voltage sampled 470M/s ▪ Static differences compensated by per-core LDOs ▪ Dynamic differences mitigated by clock stretcher, DPM states I/ODDR Die 2 I/O DDRDDRDDR +10.8 +10.8 +7.3 +14.4 +8.9 +12.5 -5.0 -20.9 I/ODDR Die 1 I/O I/O I/O DDRDDRDDR +10.5 +10.9 +10.9 -6.5+25.1 +14.1 +7.4 +14.3 I/ODDR Die 3 I/O I/O I/O DDRDDR -9.5 -3.0 -3.0 +3.8 -6.4 -13.1 -16.4 -25.7 I/ODDR Die 0 I/O I/O I/O DDR -7.3 +6.5 -6.6 -20.1-7.3 -7.3 -9.7 -0.2 * See Endnotes for additional system configuration details
  21. 21. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 21 of 29 Core Voltage Measurements ▪ Measured data shows excellent tracking of per- core voltage from the digital LDO with mV- accurate target voltage ▪ Power savings through per-core voltage optimization * See Endnotes for additional system configuration details
  22. 22. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 22 of 29 4-Chip EPYC Package ▪ 128 lanes can be used as PCIe – Attach six 16-lane accelerator cards to a single socket ▪ 8 DDR4 channels NIC 16 DIMMs Memory 8 Drives Single Socket AMD EPYCTM System 64 lanes High-speed I/O I/O Die2 CCX CCX I/O DDR Die1 CCX CCX I/ODDR I/O Die3 CCX CCX I/O DDR Die0 CCX CCX I/ODDR I/OI/O 64 lanes High-speed I/O 4 Channels DDR4 4 Channels DDR4
  23. 23. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 23 of 29 Dual 4-Chip EPYC Packages Dual Socket AMD EPYCTM System 128 lanes High-speed I/O I/O Die2 CCX CCX I/O DDR Die1 CCX CCX I/O DDR Die0 CCX CCX I/O DDR I/OI/O 4 Channels DDR4 4 Channels DDR4 I/O Die2 CCX CCX I/O DDR Die1 CCX CCX I/O DDR Die0 CCX CCX I/O DDR I/OI/O 4 Channels DDR4 4 Channels DDR4 Die3 CCX CCX I/O DDRI/O I/O Die3 CCX CCX I/O DDR
  24. 24. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 24 of 29 Single Chip AM4 Package ▪ Socket compatible with other AMD SoCs for desktop market ▪ 8 cores / 16 threads ▪ 2 DDR4 channels ▪ 24 PCIe Gen3 lanes ▪ Up to 95W TDP AMD RyzenTM System Die CCX CCX I/O DDR I/O 24 lanes High-speed I/O 2 Channels DDR4
  25. 25. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 25 of 29 2-Chip sTR4 Package ▪ Socket defined for “Zeppelin” SoC and compatible with future designs ▪ 16 cores / 32 threads ▪ 4 DDR4 channels ▪ 64 PCIe Gen3 lanes AMD RyzenTM ThreadripperTM System 32 lanes High-speed I/O I/O Die 1 CCX CCX I/O DDR Die 0 CCX CCX I/O DDR I/O 32 lanes High-speed I/O 2 Channels DDR4 2 Channels DDR4
  26. 26. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 26 of 29 Benchmark results ▪ Scalable performance from single-chip up to 8-chip 2-socket configuration * See Endnotes for additional system configuration details
  27. 27. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 27 of 29 IFIS/PCIe IFOP IFOP Zen Zen Zen Zen L3 Zen Zen Zen Zen L3 IFOP IFIS/PCIe/SATAIFOP DDR DDR An SoC for Multi-chip Architectures Mainstream Desktop Performance Server High-End Desktop Dummy Dummy
  28. 28. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 28 of 29 Acknowledgment ▪ We would like to thank our talented AMD design teams across Austin, Bangalore, Boston, Fort Collins, Hyderabad, Markham, Santa Clara, and Shanghai, who contributed on “Zen” and “Zeppelin” ▪ Please check out our demo tonight
  29. 29. 2.4: “Zeppelin”: an SoC for Multi-chip Architectures© 2018 IEEE International Solid-State Circuits Conference 29 of 29 Endnotes AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. Slides 9, 10: Latencies assume 2.4GHz CPU core frequency and 1R DDR4-2667 19-19-19 RDIMM; Memory, IFIS, IFOP latencies are dependent on DRAM clock; Memory latencies include testing overhead (including DRAM refresh). Slides 20, 21: Power measurements taken from a SP3 Diesel non-DAP AMD evaluation system, with EPYC rev B1 parts, BIOS revision WDL7405N, Windows Server 2016, running a Max Power pattern at 2.5GHz core frequency Slide 26: AMD RyzenTM 7 1800X CPU scored 211, using estimated scores based on testing performed in AMD Internal Labs as of 30 March 2017. System config: RyzenTM 7 1800X: AMD Myrtle-SM with 95W R7 1800X, 32GB DDR4-2667 RAM, Crucial CT256M550SSD, Ubuntu 15.10, GCC –O2 v4.6 compiler suite. AMD RyzenTM ThreadripperTM 1950X CPU scored 375, using estimated scores based on testing performed in AMD Internal Labs as of 7 September 2017. System config: RyzenTM ThreadripperTM 1950X: AMD Whitehaven-DAP with 180W TR 1950X, 64GB DDR4-2667 RAM, CT256M4SSD disk, Ubuntu 15.10, GCC –O2 v4.6 compiler suite. AMD EPYCTM 7601 CPU scored 702 in a 1-socket using estimated scores based on internal AMD testing as of 6 June 2017. 1 x EPYCTM 7601 CPU in HPE Cloudline CL3150, Ubuntu 16.04, GCC -O2 v6.3 compiler suite, 256 GB (8 x 32 GB 2Rx4 PC4-2666) memory, 1 x 500 GB SSD AMD EPYCTM 7601 scored 1390 in a 2-socket system using estimated scores based on internal AMD testing as of 6 June 2017. 2 x EPYCTM 7601 CPU in Supermicro AS-1123US-TR4, Ubuntu 16.04, GCC -O2 v6.3 compiler suite, 512 GB (16 x 32GB 2Rx4 PC4-2666 running at 2400) memory, 1 x 500 GB SSD.

×