SMT Verification of the POWER5 and POWER6 High-Performance Processors

1,100 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,100
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

SMT Verification of the POWER5 and POWER6 High-Performance Processors

  1. 1. IBM Power Systems© 2008 IBM CorporationSMT Verification of the POWER5 and POWER6High-Performance ProcessorsJohn LuddenSenior Technical Staff MemberHardware VerificationIBM Systems & Technology Group
  2. 2. IBM System p2 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance Processors1. What is a multi-threaded processor?• Essentially a processor core that executes multipleinstruction streams simultaneously• Each thread appears to software as a “virtual” processor core2. What are the advantages of SMT?• More efficient utilization of silicon real estate and power: smalldie size increase compared to adding another core• Increased system throughput by utilizing processor resourcesthat would otherwise be idle3. What are the disadvantages of SMT?• Increased complexity -> Makes verification state space MUCHlarger• SMT verification much harder than SMP• Possibly degrades performance of some applicationsIntroduction to Simultaneous Multi-Threading(SMT)
  3. 3. IBM System p3 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance Processors1. Video Game Systems• Sony Playstation 3: IBM CELL processor• Xbox 360: IBM Xenon processor2. Personal Computers:• Intel Pentium 4 Hyper-Threading (HT) processors3. Servers:• SUN UltraSparc Systems: T1 (4 threads) and T2 (8 threads)• HP Superdome Systems: Intel Itanium 2• IBM Power Systems: POWER5 and POWER6 processorsExamples of SMT microprocessors
  4. 4. IBM System p4 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance Processors1. Context : POWER5 vs. POWER6 Microarchitecture Comparison2. Verification methodology: In the beginning…3. The times they are a changing: SMT arrives in POWER54. POWER6: An in-order design should be simpler, but…5. Future directions?Overview
  5. 5. IBM System p5 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsConsistent predictable deliveryIBM POWER systemsPOWER4+POWER4POWER5POWER5+POWER620012003200420062007
  6. 6. IBM System p6 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER5 ChipHigh FreqPOWER5SMT2 Core~2 MB L236 MB L3Controller36 MBL3ChipSMP Interconnect FabricMemoryControllerBufferChipsHigh FreqPOWER5SMT2 CorePOWER6 ChipUltra FreqPOWER6SMT2 Core4 MB L232 MB L3Controller32 MBL3Chip(s)SMP Interconnect FabricUltra FreqPOWER6SMT2 Core4 MB L2MemoryControllerMemoryControllerBufferChipsBufferChips
  7. 7. IBM System p7 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER5 PipelineMP ISS RF EA DC WB XferMP ISS RF EX WB XferMP ISS RF EX WB XferMP ISS RF F6XferF6F6F6F6F6CPBRLD/STFXFPGroup Formation andInstruction DecodeInstruction FetchBranch RedirectsInterrupts & FlushesOut-of-Order ProcessingWBFmtD1 D2 D3 Xfer GDD0D0Shared by two threads Resource used by thread 1Resource used by thread 0Shared IssueQueuesCPLSU0FXU0LSU1FXU1FPU0FPU1BXUCRLSharedExecutionUnitsRead SharedRegister FilesDynamicInstructionSelectionThreadPriorityGroup Formation,Instruction Decode,DispatchSharedRegisterMappersAlternateTargetCacheBranch PredictionInstructionTranslationInstructionCacheProgramCounterBranchHistoryTablesReturnStackInstructionBuffer 1InstructionBuffer 0Write SharedRegister FilesGroupCompletionStoreQueueDataCacheDataTranslationL2CacheIF BPICIF
  8. 8. IBM System p8 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsHigh-end server: New POWER6 microprocessorTopology– Two cores on chip, a 2-way SMP– Core private L1s (64KB I, 64KB D)– Superscalar, SMT cores– Chip private 8 MB L2 cache– L3 32 MB off chip– Two-tier SMP fabricTechnology– 65 nm SOI– 341 mm2 die size– 10 Layers of metal– 790 million transistors on chip– Frequency : 3.5, 4.2, 4.7, 5.0 GHzCustom & semi-custom design style– High frequency constraints3.3 M Lines of VHDL
  9. 9. IBM System p9 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER6 core pipelineInstruction fetch pipelineInstruction fetch pipelineBR/FX/Load pipelineBR/FX/Load pipelineFloating Point PipelineFloating Point Pipeline Check Point Recovery PipelineCheck Point Recovery PipelineBR/CRBR/CRFXFXLOADLOADLegend :Legend : Pre-decode stageIfetch/Branch stageDelayed/Transmit stageInstruction Decode stageInstruction Dispatch/Issue stageOperand access/execution stageWrite back stageCompletion stageCheck Point stageFX result bypassLoad result bypassFloat result bypassCache access stageP1P1P2P2P3P3P4P4 IC0IC0 ROTROTIC1IC1EX1EX1FMTFMTAGAGDISPDISPPDPDIB0IB0 IB1IB1RFRFRFRFRFRFRFRF DC0DC0 DC1DC1EX2EX2 EX3EX3 EX4EX4 EX5EX5 EX6EX6 EX7EX7EXEXISSISS ECCECCECCECCBHTBHTBHTBHTIFARIFARInstruction dispatch pipelineInstruction dispatch pipeline
  10. 10. IBM System p10 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER6 corePOWER6 processor is ~2X frequency of POWER5 (4 – 5 GHz)POWER6 instruction pipeline depth equivalent to POWER5– Minimize power– Scale performance with frequencyInstruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/ExecuteFXU Dependent executionLoad Dependent executionPOWER6 extends functionality of POWER5 core– 64K I cache, 64K D cache, 2 FXU, 2 Binary FPU, 1 branch execution unit– Two way SMT with 7 instruction dispatch from 2 threads (maximum of 5 instructions per thread)– Decimal Floating Point Unit– VMX Unit (PowerPC’s SIMD ISA)– Recovery Unit~6ns/instr~3ns/instr
  11. 11. IBM System p11 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsBullet-proof computingSystem reliability with recovery unit– Every measure possible taken to preserve application execution– Retry soft errors– Change hardware for hard errorsProcessor architected state check pointedEvery 1 cycleECC & Non-ECC protected circuitry checkedEvery cycleProcessor restarts from last saved checkpointProcessor workload moved to another CPUNo error foundNo error foundError foundError foundSoft error caseHard error case
  12. 12. IBM System p12 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsOverview1. Context : POWER5 vs. POWER6 microarchitecture comparison2. Verification methodology: In the beginning…3. The times they are a changing: SMT arrives in POWER54. POWER6: An in-order design should be simpler, but…5. Future directions?
  13. 13. IBM System p13 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER4/5/6 RTL verification technologyRTL(VHDL, Verilog)Language CompileModel BuildPhysical VLSIDesign Tools /Custom DesignCycle-basedModelFormalVerification:BooleanEquivalenceCheck(Verity)Software Simulator(MESA)HardwareAccelerator(Awan)Driver/CheckerAssertionsTest ProgramGenerator(GPRO, X-Gen)C++TestbenchConstraintRandomUnitTestbenchPSL et al.(Semi) FormalVerification(SixthSense,RuleBase)
  14. 14. IBM System p14 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsSingle threaded uniprocessor verification for POWER4Unit level: methodology inherited from POWER4– Driven by a combination of instruction level test cases (AVPs) created by Genesys-Pro (GPRO) pseudo-random test generator and random C++ driven irritation– Instruction-By-Instruction (IBI) checking against AVP results– Low level microarchitecture checkers written in C++Processor core (aka “core”) level– Mixture of GPRO pseudo-random and directed random instruction level test cases– IBI checking against AVP results– Low level microarchitecture checkers written in C++- Irritation from random C++ drivers- Highly deterministic and architected state easily verifiable against test
  15. 15. IBM System p15 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsSymmetric multi-processor (SMP) verification for POWER4Chip (dual-core) level– Test generation similar to uniprocessor via GPRO for false-sharingor non-sharing tests• IBI checking against AVP results for two-independent instruction streamscontained within single test• Low level microarchitecture checkers written in C++• L1/L2 interactions primary focus– True-sharing scenarios, lock testing and storage access (“weak”)ordering checked• GPRO employed but….– IBI checking of these accesses is limited or not possible:› Non-unique or non-deterministic results› CML (architecture level coherency monitor) employed to detectthe “right answer” as a post-simulation rule check
  16. 16. IBM System p16 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsOverview1. Context : POWER5 vs. POWER6 microarchitecture comparison2. Verification methodology: In the beginning…3. The times they are a changing: SMT arrives in POWER54. POWER6: An in-order design should be simpler, but…5. Future directions?
  17. 17. IBM System p17 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER5 SMT verification methodologyEvolutionary based on single thread uniprocessor and SMPapproaches– Traditional SMP scenarios now self-contained in a single core simulation model• Downward migration of dual-core methodology to single core modelNew SMT verification scenario categories– Shared resource and priority conflicts:• SMT resource types:– Equally shared between threads: Queue full conditions easier to hit– Dynamically shared / tagged: Either thread can consume most/all of theresource– Replicated: Not shared…same as single thread– Dynamic thread mode switching: SMT->ST; ST->SMT• Some applications attain better performance in ST mode• Shared resources re-allocated on each mode switch
  18. 18. IBM System p18 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsTraditional SMP approach applied to SMT verificationSMT.tstRandom t0 Random t1Core Level Registers common to both threadst0 RegistersSMP.def(test template)TestGenerationReal memory is common to both threads with test generatormanaging some potential overlapt1 RegistersOutput test caseSMT.tstRandom t0 Random t1Core Level Registers common to both threadst0 RegistersSMP.def(test template)TestGenerationReal memory is common to both threads with test generatormanaging some potential overlapt1 RegistersOutput test case
  19. 19. IBM System p19 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsShared resource and priority conflictsApproach was similar to SMP verification– Testing largely consisted of “symmetric” instruction streamson each thread• A particular resource targeted (e.g., GPR rename registers)– 100 load instructions on each thread– Coverage and lab feedback validated this approach• Good enough: “Got the job done”
  20. 20. IBM System p20 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER5 dynamic thread mode switchingAll architected states initializedThread enabledInitialStateThread 0 terminatesitselfShared resourcesreallocatedRandom instructionsNormal finishThread enabledRunStateRandom instructionsRestart thread 0Normal finishThread enabledFinalStateAll architected states initializedThread enabledSave architectedstateWake up threadPartition resourcesRestore architectedstateThread killsitselfRandom instructionsThread 0 Thread 1Sim DriverOther threadInterrupt
  21. 21. IBM System p21 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER5 shared resource re-allocation on mode switch0100200GPR FPRRename Registers perthreadSMT ModeMaxST Mode 0510Split in halfLoad Miss Queue entriesper threadSMT ModeST Mode01020Split in halfBranch Queue (BIQ)entries per threadSMT ModeST Mode02040DynamicallySharedMax LRQ/SRQ entries perthreadSMT modeMaxST mode
  22. 22. IBM System p22 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsOverview1. Context : POWER5 vs. POWER6 microarchitecture comparison2. Verification methodology: In the beginning…3. The times they are a changing: SMT arrives in POWER54. POWER6: An in-order design should be simpler, but…5. Future directions?
  23. 23. IBM System p23 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER5: centralized complexityPOWER5– Out-of-order design: Even in single thread mode,complex events naturally occur simultaneously– Started from POWER4+: Known workingdesign that was modified incrementally– 23 FO4 design: Isolated complexity inInstruction Sequencing Unit (ISU):• Every unit communicated back to ISU• ISU resolved all exceptions andout-of-order conflicts– ST and SMT modes both supported:• Alternating dispatch cycles per thread• Resources re-allocated on mode switchFXUFPULSUIFUISU
  24. 24. IBM System p24 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsPOWER6 distributed complexityPOWER6– From-scratch mostly in-order design• Normally, design is well behaved• Cross-thread interaction necessary for “toughbugs”– 13 FO4 design: Distributed complexity needed toachieve high performance goals– Recovery unit (RU):• Must resolve out-of-order FP with in-orderpipelines• Checkpoints machine state• Recovers processor from soft errors– Design is inherently in SMT mode all the time(almost)• Dispatch to both threads in same cycle• Most resources dynamically shared / tagged• No resource reallocation on mode switchIFUIDUFPULSURUFXU
  25. 25. IBM System p25 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsThe different verification engines have different strengths related to theverification tasksPOWER6 verification processSoftware simulation– Slow, but low penalty for highly intrusive checking of model internals. Total model visibility.– Hundreds of AIX workstations running 24x7x365– New enhancements helped keep pace with design complexity– 2x number of simulation cycles of POWER5 designHardware-accelerated simulation– 10-1k x Faster than SW sim, but need less intrusive driving/checking to not slow down hardware box.– New usage: Mainline function verification– Yields additional 3x simulation cycle advantage over POWER5 (5x cycle advantage overall)(Semi)-formal verification– (High to) Exhaustive coverage, but higher skill needed to drive. Scaling problems w/ model size.– Extensively used: Proved extremely valuable for complex SMT bugsHardware bring-up– Ideal speed, very limited visibility/controllability
  26. 26. IBM System p26 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsSoftware simulation enhancementsRandom command driven unit simulation for most core units– Yielded >1 Million lines of C++ code– More control over generation for low level events– More efficient test generationIrritator threads at “core model” level– “Symmetric” instruction stream approach employed on POWER5 proved inadequate“S” in SMT is for “Simultaneous”, not “Symmetric”– Target cross-thread interactions at the microarchitecture level– ~2x test generation efficiency– Ensures both threads running the same length (self adjusting)
  27. 27. IBM System p27 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsIrritator thread exampleSMT_Irritator.tstLongRandom t0ShortIrritator t1Core Level Registers common to both threadsSMT_Irritator.def(test template)TestGenerationReal memory with test generator managing some potential overlapIrritator thread restrictions• Cannot cause unexpectedexceptions• Cannot modify memory readby random thread• Cannot modify registersshared with other threads• Architected results may beundefinedt1 Registerst0 RegistersOutput test case
  28. 28. IBM System p28 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsIrritator thread exampleSEQUENCEREPEAT 100SELECTGroup_Allstw nop, ASEQUENCELB0: fdivA: b to LB0Long Random Thread Irritator ThreadGenerated Instr: 101Simulated Instr: 101Generated Instr: 2Simulated Instr: InfiniteKill Irritator Thread
  29. 29. IBM System p29 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsSimulation acceleration usage on POWER6Extensively used on POWER6– Run lab exercisers prior to tape-out•Found additional bugs missed by software simulation•Debug new exerciser functionality prior to lab•Error injection and recovery testing•Reproducibility of lab bugs in “simulation-like” environment forrapid debug of root cause•Rapid testing of bug fixes and collateral damage testing– Linux boot prior to tape-out– Not employed on POWER5 for “mainline” functionalverification
  30. 30. IBM System p30 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsFormal methods are a vital complement tosimulation flow– Lab bring-up bug re-creation• Often faster reproduction than simulation basedapproaches• Aids in root cause analysis• High-coverage / proof of side-effect-free fixes(Semi) Formal methods
  31. 31. IBM System p31 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsError detection and soft error recoveryBiggest challenge on POWER6– Why so hard?• Myriads of injection points coupled with large SMT state space– Often needed multiple “rare” combinations of “asymmetric”events on both threads while specific error was injected• End-to-end recovery testing difficult at unit level– Really a “core” effort– Verification strategy:– Error injection and recovery on hardware accelerated simulationplatform– Dynamic on-the-fly error injection combined with “irritator threads”needed to cover large SMT recovery state space
  32. 32. IBM System p32 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsSummary1. SMT verification has four key pieces– Traditional SMP-like effort– Thread starvation and priority– Starting and stopping threads– Asymmetric “irritator thread” approach to verify often unforeseen cross-thread interactions atthe microarchitecture level2. “From-scratch in-order” SMT design was more difficult to verify than the“out-of-order retrofitted” SMT design– Complex events only occurred due to cross thread interaction– Even though team had experience– Required more “weapons” in the arsenal3. High frequency design drove distributed complexity– Makes verification job harder– Increased dependency on formal verification for difficult bugs4. “Mainframe”-like RAS on POWER6 drove a huge amount of work that wasdifficult to attack at the unit level
  33. 33. IBM System p33 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsOverview1. Context : POWER5 vs. POWER6 microarchitecture comparison2. Verification methodology: In the beginning…3. The times they are a changing: SMT arrives in POWER54. POWER6: An in-order design should be simpler, but…5. Future directions?
  34. 34. IBM System p34 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & TechnologySMT Verification of the POWER5 and POWER6 High-Performance ProcessorsFuture directionsPredictions– RAS features will be an increasingly important feature of serversystems• POWER6 design has set the “bar” to a new high standard to which futureprocessors will have to measure up- Power Systems Revenue up 29% in 2Q08 (from 2Q07)• Verification methods employed on POWER6 to attack nearly infinite statespace created by the combination of SMT and processor recovery features willbecome standard practice– A migration of “pre-silicon” verification techniques into “post-silicon”hardware lab verification effort• Hardware is the fastest “simulator” available and the state space is gettingbigger with SMT

×