Ludden q3 2008_boston

608 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
608
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ludden q3 2008_boston

  1. 1. IBM Power Systems SMT Verification of the POWER5 and POWER6 High-Performance Processors John Ludden Senior Technical Staff Member Hardware Verification IBM Systems & Technology Group © 2008 IBM Corporation
  2. 2. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Introduction to Simultaneous Multi-Threading (SMT) 1. What is a multi-threaded processor? • Essentially a processor core that executes multiple instruction streams simultaneously • Each thread appears to software as a “virtual” processor core 2. What are the advantages of SMT? • More efficient utilization of silicon real estate and power: small die size increase compared to adding another core • Increased system throughput by utilizing processor resources that would otherwise be idle 3. What are the disadvantages of SMT? • Increased complexity -> Makes verification state space MUCH larger • SMT verification much harder than SMP • Possibly degrades performance of some applications 2 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  3. 3. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Examples of SMT microprocessors 1. Video Game Systems • Sony Playstation 3: IBM CELL processor • Xbox 360: IBM Xenon processor 2. Personal Computers: • Intel Pentium 4 Hyper-Threading (HT) processors 3. Servers: • SUN UltraSparc Systems: T1 (4 threads) and T2 (8 threads) • HP Superdome Systems: Intel Itanium 2 • IBM Power Systems: POWER5 and POWER6 processors 3 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  4. 4. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Overview 1. Context : POWER5 vs. POWER6 Microarchitecture Comparison 2. Verification methodology: In the beginning… 3. The times they are a changing: SMT arrives in POWER5 4. POWER6: An in-order design should be simpler, but… 5. Future directions? 4 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  5. 5. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors IBM POWER systems Consistent predictable delivery 2007 2006 POWER6 2004 POWER5+ 2003 POWER5 2001 POWER4+ POWER4 5 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  6. 6. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER5 Chip POWER6 Chip High Freq High Freq Ultra Freq Ultra Freq POWER5 POWER5 POWER6 POWER6 SMT2 Core SMT2 Core SMT2 Core SMT2 Core ~2 MB L2 4 MB L2 4 MB L2 36 MB 32 MB 36 MB L3 L3 32 MB L3 L3 Controller Chip Controller Chip(s) SMP Interconnect Fabric SMP Interconnect Fabric Memory Memory Memory Controller Controller Controller Buffer Buffer Buffer Chips Chips Chips 6 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  7. 7. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER5 Pipeline Out-of-Order Processing Branch Redirects Instruction Fetch BR MP ISS RF EX WB Xfer IF IC LD/ST IF BP CP MP ISS RF EA DC Fmt WB Xfer CP D0 D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer FX Group Formation and MP ISS RF F6 F6 F6 FP Instruction Decode F6 F6 F6 WB Xfer Interrupts & Flushes Branch Prediction Dynamic Instruction Shared Selection Branch Return Target Execution Program Shared Issue History Stack Cache Units Counter Queues Tables LSU0 Alternate FXU0 Instruction LSU1 Buffer 0 Group Formation, Instruction FXU1 Group Store Instruction Decode, Cache Completion Queue Instruction Dispatch FPU0 Instruction Buffer 1 FPU1 Translation BXU Thread Data Data Priority Shared CRL Translation Cache Read Shared Write Shared Register Register Files Register Files Mappers L2 Cache Shared by two threads Resource used by thread 0 Resource used by thread 1 7 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  8. 8. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors High-end server: New POWER6 microprocessor Topology – Two cores on chip, a 2-way SMP – Core private L1s (64KB I, 64KB D) – Superscalar, SMT cores – Chip private 8 MB L2 cache – L3 32 MB off chip – Two-tier SMP fabric Technology – 65 nm SOI – 341 mm2 die size – 10 Layers of metal – 790 million transistors on chip – Frequency : 3.5, 4.2, 4.7, 5.0 GHz Custom & semi-custom design style – High frequency constraints 3.3 M Lines of VHDL 8 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  9. 9. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER6 core pipeline P1 BR/CR P2 RF IFAR FX P3 RF EX P4 IC0 IC1 ROT IB0 IB1 PD DISP RF AG DC0 DC1 FMT LOAD BHT Instruction dispatch pipeline BR/FX/Load pipeline BHT RF ISS ECC EX1 EX2 EX3 EX4 EX5 EX6 EX7 ECC Instruction fetch pipeline Floating Point Pipeline Check Point Recovery Pipeline Legend : Pre-decode stage Instruction Decode stage Write back stage Cache access stage FX result bypass Ifetch/Branch stage Instruction Dispatch/Issue stage Completion stage Load result bypass Delayed/Transmit stage Operand access/execution stage Check Point stage Float result bypass 9 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  10. 10. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER6 core POWER6 processor is ~2X frequency of POWER5 (4 – 5 GHz) POWER6 instruction pipeline depth equivalent to POWER5 – Minimize power – Scale performance with frequency Instruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/Execute ~6ns/instr ~3ns/instr FXU Dependent execution Load Dependent execution POWER6 extends functionality of POWER5 core – 64K I cache, 64K D cache, 2 FXU, 2 Binary FPU, 1 branch execution unit – Two way SMT with 7 instruction dispatch from 2 threads (maximum of 5 instructions per thread) – Decimal Floating Point Unit – VMX Unit (PowerPC’s SIMD ISA) – Recovery Unit 10 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  11. 11. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Bullet-proof computing System reliability with recovery unit – Every measure possible taken to preserve application execution – Retry soft errors – Change hardware for hard errors Processor architected state check pointed Every 1 cycle ECC & Non-ECC protected circuitry checked Every cycle No error found Error found Processor restarts from last saved checkpoint No error found Soft error case Error found Processor workload moved to another CPU Hard error case 11 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  12. 12. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Overview 1. Context : POWER5 vs. POWER6 microarchitecture comparison 2. Verification methodology: In the beginning… 3. The times they are a changing: SMT arrives in POWER5 4. POWER6: An in-order design should be simpler, but… 5. Future directions? 12 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  13. 13. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER4/5/6 RTL verification technology RTL PSL et al. Driver/Checker (VHDL, Verilog) Assertions Physical VLSI Language Compile Design Tools / Model Build Custom Design Test Program Generator (GPRO, X-Gen) Cycle-based Model Constraint C++ Random Testbench Unit Testbench Formal Software Simulator Verification: Boolean (Semi) Formal (MESA) Equivalence Verification Hardware Check (SixthSense, RuleBase) Accelerator (Verity) (Awan) 13 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  14. 14. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Single threaded uniprocessor verification for POWER4 Unit level: methodology inherited from POWER4 – Driven by a combination of instruction level test cases (AVPs) created by Genesys- Pro (GPRO) pseudo-random test generator and random C++ driven irritation – Instruction-By-Instruction (IBI) checking against AVP results – Low level microarchitecture checkers written in C++ Processor core (aka “core”) level – Mixture of GPRO pseudo-random and directed random instruction level test cases – IBI checking against AVP results – Low level microarchitecture checkers written in C++ - Irritation from random C++ drivers - Highly deterministic and architected state easily verifiable against test 14 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  15. 15. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Symmetric multi-processor (SMP) verification for POWER4 Chip (dual-core) level – Test generation similar to uniprocessor via GPRO for false-sharing or non-sharing tests • IBI checking against AVP results for two-independent instruction streams contained within single test • Low level microarchitecture checkers written in C++ • L1/L2 interactions primary focus – True-sharing scenarios, lock testing and storage access (“weak”) ordering checked • GPRO employed but…. – IBI checking of these accesses is limited or not possible: › Non-unique or non-deterministic results › CML (architecture level coherency monitor) employed to detect the “right answer” as a post-simulation rule check 15 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  16. 16. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Overview 1. Context : POWER5 vs. POWER6 microarchitecture comparison 2. Verification methodology: In the beginning… 3. The times they are a changing: SMT arrives in POWER5 4. POWER6: An in-order design should be simpler, but… 5. Future directions? 16 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  17. 17. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER5 SMT verification methodology Evolutionary based on single thread uniprocessor and SMP approaches – Traditional SMP scenarios now self-contained in a single core simulation model • Downward migration of dual-core methodology to single core model New SMT verification scenario categories – Shared resource and priority conflicts: • SMT resource types: – Equally shared between threads: Queue full conditions easier to hit – Dynamically shared / tagged: Either thread can consume most/all of the resource – Replicated: Not shared…same as single thread – Dynamic thread mode switching: SMT->ST; ST->SMT • Some applications attain better performance in ST mode • Shared resources re-allocated on each mode switch 17 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  18. 18. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Traditional SMP approach applied to SMT verification SMP.def SMP.def Test Test (test template) (test template) Generation Generation Output test case SMT.tst Core Level Registers common to both threads Core Level Registers common to both threads t0 Registers t1 Registers Random t0 Random t0 Random t1 Random t1 Real memory is common to both threads with test generator managing some potential overlap 18 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  19. 19. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Shared resource and priority conflicts Approach was similar to SMP verification – Testing largely consisted of “symmetric” instruction streams on each thread • A particular resource targeted (e.g., GPR rename registers) – 100 load instructions on each thread – Coverage and lab feedback validated this approach • Good enough: “Got the job done” 19 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  20. 20. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER5 dynamic thread mode switching Thread 0 Thread 1 All architected states initialized All architected states initialized Initial Thread enabled Thread enabled State Random instructions Random instructions Thread kills Save architected itself state Restart thread 0 Thread 0 terminates read Oth er th itself Shared resources reallocated Random instructions Wake up thread Interrupt Partition resources Sim Driver Restore architected Run state State Normal finish Normal finish Final Thread enabled Thread enabled State 20 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  21. 21. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER5 shared resource re-allocation on mode switch Rename Registers per Load Miss Queue entries thread per thread 200 SMT Mode 10 100 Max 5 SMT Mode 0 ST Mode 0 ST Mode GPR FPR Split in half Branch Queue (BIQ) Max LRQ/SRQ entries per entries per thread thread 20 40 SMT mode 20 10 SMT Mode 0 Max 0 ST Mode Dynamically ST mode Split in half Shared 21 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  22. 22. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Overview 1. Context : POWER5 vs. POWER6 microarchitecture comparison 2. Verification methodology: In the beginning… 3. The times they are a changing: SMT arrives in POWER5 4. POWER6: An in-order design should be simpler, but… 5. Future directions? 22 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  23. 23. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER5: centralized complexity POWER5 – Out-of-order design: Even in single thread mode, IFU complex events naturally occur simultaneously – Started from POWER4+: Known working design that was modified incrementally FXU ISU LSU – 23 FO4 design: Isolated complexity in Instruction Sequencing Unit (ISU): • Every unit communicated back to ISU • ISU resolved all exceptions and out-of-order conflicts FPU – ST and SMT modes both supported: • Alternating dispatch cycles per thread • Resources re-allocated on mode switch 23 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  24. 24. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER6 distributed complexity POWER6 IFU – From-scratch mostly in-order design • Normally, design is well behaved FXU IDU • Cross-thread interaction necessary for “tough bugs” – 13 FO4 design: Distributed complexity needed to achieve high performance goals – Recovery unit (RU): • Must resolve out-of-order FP with in-order pipelines • Checkpoints machine state RU FPU • Recovers processor from soft errors – Design is inherently in SMT mode all the time (almost) LSU • Dispatch to both threads in same cycle • Most resources dynamically shared / tagged • No resource reallocation on mode switch 24 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  25. 25. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors POWER6 verification process The different verification engines have different strengths related to the verification tasks Software simulation – Slow, but low penalty for highly intrusive checking of model internals. Total model visibility. – Hundreds of AIX workstations running 24x7x365 – New enhancements helped keep pace with design complexity – 2x number of simulation cycles of POWER5 design Hardware-accelerated simulation – 10-1k x Faster than SW sim, but need less intrusive driving/checking to not slow down hardware box. – New usage: Mainline function verification – Yields additional 3x simulation cycle advantage over POWER5 (5x cycle advantage overall) (Semi)-formal verification – (High to) Exhaustive coverage, but higher skill needed to drive. Scaling problems w/ model size. – Extensively used: Proved extremely valuable for complex SMT bugs Hardware bring-up – Ideal speed, very limited visibility/controllability 25 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  26. 26. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Software simulation enhancements Random command driven unit simulation for most core units – Yielded >1 Million lines of C++ code – More control over generation for low level events – More efficient test generation Irritator threads at “core model” level – “Symmetric” instruction stream approach employed on POWER5 proved inadequate “S” in SMT is for “Simultaneous”, not “Symmetric” – Target cross-thread interactions at the microarchitecture level – ~2x test generation efficiency – Ensures both threads running the same length (self adjusting) 26 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  27. 27. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Irritator thread example SMT_Irritator.def Test (test template) Generation Output test case SMT_Irritator.tst Core Level Registers common to both threads t0 Registers t1 Registers Irritator thread restrictions • Cannot cause unexpected exceptions • Cannot modify memory read Long Short by random thread Random t0 Irritator t1 • Cannot modify registers shared with other threads • Architected results may be undefined Real memory with test generator managing some potential overlap 27 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  28. 28. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Irritator thread example Long Random Thread Irritator Thread SEQUENCE SEQUENCE REPEAT 100 SELECT Kill Irritator Thre ad LB0: fdiv Group_All A: b to LB0 stw nop, A Generated Instr: 101 Generated Instr: 2 Simulated Instr: 101 Simulated Instr: Infinite 28 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  29. 29. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Simulation acceleration usage on POWER6 Extensively used on POWER6 – Run lab exercisers prior to tape-out • Found additional bugs missed by software simulation • Debug new exerciser functionality prior to lab • Error injection and recovery testing • Reproducibility of lab bugs in “simulation-like” environment for rapid debug of root cause • Rapid testing of bug fixes and collateral damage testing – Linux boot prior to tape-out – Not employed on POWER5 for “mainline” functional verification 29 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  30. 30. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors (Semi) Formal methods Formal methods are a vital complement to simulation flow – Lab bring-up bug re-creation • Often faster reproduction than simulation based approaches • Aids in root cause analysis • High-coverage / proof of side-effect-free fixes 30 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  31. 31. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Biggest challenge on POWER6 Error detection and soft error recovery – Why so hard? • Myriads of injection points coupled with large SMT state space – Often needed multiple “rare” combinations of “asymmetric” events on both threads while specific error was injected • End-to-end recovery testing difficult at unit level – Really a “core” effort – Verification strategy: – Error injection and recovery on hardware accelerated simulation platform – Dynamic on-the-fly error injection combined with “irritator threads” needed to cover large SMT recovery state space 31 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  32. 32. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Summary 1. SMT verification has four key pieces – Traditional SMP-like effort – Thread starvation and priority – Starting and stopping threads – Asymmetric “irritator thread” approach to verify often unforeseen cross-thread interactions at the microarchitecture level 2. “From-scratch in-order” SMT design was more difficult to verify than the “out-of-order retrofitted” SMT design – Complex events only occurred due to cross thread interaction – Even though team had experience – Required more “weapons” in the arsenal 3. High frequency design drove distributed complexity – Makes verification job harder – Increased dependency on formal verification for difficult bugs 4. “Mainframe”-like RAS on POWER6 drove a huge amount of work that was difficult to attack at the unit level 32 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  33. 33. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Overview 1. Context : POWER5 vs. POWER6 microarchitecture comparison 2. Verification methodology: In the beginning… 3. The times they are a changing: SMT arrives in POWER5 4. POWER6: An in-order design should be simpler, but… 5. Future directions? 33 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
  34. 34. IBM System p SMT Verification of the POWER5 and POWER6 High-Performance Processors Future directions Predictions – RAS features will be an increasingly important feature of server systems • POWER6 design has set the “bar” to a new high standard to which future processors will have to measure up - Power Systems Revenue up 29% in 2Q08 (from 2Q07) • Verification methods employed on POWER6 to attack nearly infinite state space created by the combination of SMT and processor recovery features will become standard practice – A migration of “pre-silicon” verification techniques into “post-silicon” hardware lab verification effort • Hardware is the fastest “simulator” available and the state space is getting bigger with SMT 34 IBM Systems © 2006 IBM Corporation IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation

×