AMD_11th_Intl_SoC_Conf_UCI_Irvine

976 views
838 views

Published on

1] A New Parallel Computing Platform
– Heterogeneous System Architecture
Opportunities, Benefits and Feature Roadmap
2] Kaveri Platform Coherency
Shared memory, Platform atomics
3] Kaveri Verification Approach
4] SoC Verification Challenges and Solutions

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
976
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

AMD_11th_Intl_SoC_Conf_UCI_Irvine

  1. 1. Platform Coherency and SoC Verification Challenges PANKAJ SINGH, CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE THE 11TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA WWW.SOCCONFERENCE.COM ACKNOWLEDGEMENTS: PHIL ROGERS AMD CORPORATE FELLOW , ROY JU & BEN SANDER SR FELLOW NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES
  2. 2. TODAY’S TOPICS A New Parallel Computing Platform – Heterogeneous System Architecture Opportunities, Benefits and Feature Roadmap Kaveri Platform Coherency Shared memory, Platform atomics Kaveri Verification Approach SoC Verification Challenges and Solutions 1 HSA 2 2 KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  3. 3. A New Parallel Computing Platform – Heterogeneous System Architecture (HSA) 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  4. 4. APU: ACCELERATED PROCESSING UNIT The APU is a great advance compared to previous platforms CPU pair Combines scalar processing on CPU with parallel processing on the GPU and high-bandwidth access to memory Challenge: How do we make it even better going forward?  Easier to program  Easier to optimize  Easier to load balance  Higher performance  Lower power 4 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 GPU SIMD
  5. 5. THE HSA OPPORTUNITY ON MODERN APPLICATIONS PROBLEM SOLUTION  HSA + Libraries = productivity & performance with low power Developer Return Few M HSA coders (Differentiation in performance, reduced power, features, time to market) Few 100Ks HSA apps  GPU/HW blocks hard to program  Not all workloads accelerate Wide range of differentiated experiences PROBLEM  Historically, developers program CPUs ~20+M* CPU coders ~4M apps Good user experiences Developer Investment (Effort, time, new skills) *IDC 5 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Tens of Ks GPU coders Few hundred apps Significant niche value
  6. 6. HSA AND ITS BENEFITS HSA IS A COMPUTING PLATFORM THAT DRIVES NEW CLASS OF APPLICATIONS App-Accelerated Software Applications Graphics Workloads Data-Parallel Workloads Serial and Task-Parallel Workloads HSA is an enabler of APU’s higher performance and power efficiency Our industry-leading APUs speed up applications beyond graphics CPU and GPU (APUs) work cooperatively together directly in system memory Makes programming the APU as easy as C++ Improves Performance per watt 6 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Ref [1]
  7. 7. HSA EFFICIENCY IMPROVEMENT (AN EXAMPLE) Improves Power and Performance: Move application from CPU to GPU, remove data copies, and reduce launch time 35 W Measured Power 25 fps 20 fps 30 W 25 W DRAM NB+GPU DRAM 15 fps 20 W NB+GPU 15 W 10 W Measured Perf 10 fps CPU Cores CPU Cores 5W 5 fps CPU+GPU 0 fps 0W CPU CPU Simulate removing memory copies: 1.32 X CPU+GPU  1.11 * 2.88 * 1.32 = 4.22 X Better Energy Efficiency  Easier to Program + Remove Copies ENERGY COMPUTATION BREAKDOWN: MOTIONDSP 720P VIDEO CLEAN-UP 7 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Ref [1]
  8. 8. HETEROGENEOUS SYSTEM ARCHITECTURE FEATURE ROADMAP Physical Integration Optimized Platforms Integrate CPU & GPU in silicon Architectural Integration System Integration Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User Mode Schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Common Manufacturing Technology 8 GPU Compute C++ support Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Quality of Service | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  9. 9. PLATFORM COHERENCY 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  10. 10. KAVERI SOC – ENABLING SHARED MEMORY AND PLATFORM ATOMICS Shared memory accesses between the CPU and GPU happens via ‘system memory’. – Corresponds to the notion of shared virtual memory (SVM) in OpenCL 2.0, available via clSVMalloc() call. With SVM, CPUs and GPUs can share an address space and share the pointer to the same memory location. – The compiler supports SVM and atomics calls that work across the CPU-GPU boundary. – System-memory accesses may go one of three paths  If coherence with CPU is not required: GARLIC path  If kernel-granularity coherence with CPU is required: ONION bus path  If instruction-granularity coherence with CPU is required: Bypass L2 via ONION+ bus (required by atomics) 10 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  11. 11. CONCURRENT STACK PUSH USING ATOMIC COMPARE-ANDEXCHANGE (AN EXAMPLE) Each CPU thread and each GPU workitem execute the following code concurrently:  The code shows an example implementation of a concurrent stack’s “push” operation.  The “compare_exchange_strong” is an atomic call that ensures only one of the CPU/GPU thread/workitem succeeds in updating the “head” pointer of the stack stored in list[0] do { head = list[0]; //redundant because the atomic call updates head on failure list[i] = head; } while (!atomic_compare_exchange_strong(&list[0], &head,i)); 0 3 0 2 1 1 2 2 3 3 5 3 5 4 4 5 -1 … 5 i=2 and i=4 contest for ACE (List: 3 (head)->5->-1) 99 Time Instant Workitem i=2 … 99 List after i=2 wins! (List: 2 (head)->3->5->-1) Workitem i=4 Before ACE head=3, list[2]=3 head=3,list[4]=3 ACE Wins! After ACE completes list[0]=2 11 -1 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Loses and goes back & retries list[0]=2
  12. 12. IMPLEMENTING PLATFORM ATOMICS FOR KAVERI  The compiler has implemented these atomics (per OpenCL 2.0 standards) for Kaveri.  The key issue in implementing these atomics is to make sure that both CPU and GPU see the shared memory in “coherent” state.  The coherency is implemented using the ONION+ memory path and using the GPU ISA instructions, which can invalidate/bypass L1/L2 caches selectively from the GPU side and snoop to invalidate the CPU caches. This support is provided in the KV SOC.  For example: atomic_load with acquire semantics generates code on the GPU side as shown (in Kaveri L2 is always bypassed for coherent access). Similarly, atomic_store with release semantics generates the GPU ISA given later. 1. load with glc=1 2. S_waitcnt 0 3. buffer_wbinv_vol // bypass the L1 cache 1. s_waitcnt 0 2. store with glc=0 // wait for any previous memop to complete // L1 is a write-through cache, so write onto memory as L2 is bypassed // prevent any following memop to move up 3. s_waitcnt 0; // wait for the load to complete // invalidate L1 so that any following load reads from memory  OpenCL 2.0 and C11 atomics support various kinds of memory_scope & memory_ordering 12 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  13. 13. KAVERI SOC VERIFICATION APPROACH 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  14. 14. TRADITIONAL VERIFICATION AND SOC CHALLENGE CPU NorthBridge DRAM Model Graphics model GFX SouthBridge BFM CPU-BASED VERIFICATION  Assembly based input  Memory image of x86 machine code is preloaded into DRAM model  CPU fetches instructions from DRAM and executes them GPU-BASED VERIFICATION  Higher language (C/C++)  BFM model used across PCIe-based interface to inject data  GPU sends requests to DRAM over 2 paths: coherent and non-coherent SoC Verification Challenge  Layer of complexity due to HSA coherency environment.  SoC GPU needs to be programmed, which requires host  SoC CPU can be used the host. However, running the same host software stack results in huge simulation time  One approach is Mailbox:  Inefficient due to lack of CPU-GPU interaction, longer run time.  GPU-focused verification not suitable for CPU-GPU interaction (HSA) 14 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  15. 15. SOC VERIFICATION METHODOLOGY: TEST FLOW GPU Test Test (Open CL) CPU Test One Thread [ Driver CPU] Running driver code on simulated CPU is impossible due to simulation run-times. Intent Capture is a mechanism to allow existing discrete GPU graphics tests to execute on the CPU in a Heterogeneous APU simulation. Intent Capture Capture Other Threads sp3 shader Output Replay() CX Shell .sim memory image APU RTL Sim Test Output Runs  The memory accesses and configuration writes from the test are extracted into C function calls  Intent Capture performs this activity and encapsulates the GPU test into a function called Replay.  On CPU side, one thread runs Replay function while other threads execute the CPU side of the test.  Composite test (CPU test + generated FusionReplay function) is compiled using cxshell to generate a .sim memory image 15 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Ref [4]
  16. 16. POWER MANAGEMENT: BAPM Multiple Boost Pstates Pb0 ... Core cores Pwr @ Pbase Core Pwr Core Pwr Rest of APU Pwr Die Temp  APU Pwr  Pbx Rest of APU Pwr App1 with Rest of APU Pwr App2 with Low CAC Allcores active SWP0 P1 SWP1 … … HW View App3 with High CAC Med CAC HalfAll-cores active cores active P0/Pbase SW/OS View ILLUSTRATION WITH CPU-CENTRIC SCENARIO Ref[2] CPU Core1 CPU Core2 Compute Unit Power Monitor calculates CPU Power If Temp > Limit, reduce power allocation    Firmware converts power into temperature estimates Compare Temperature to Limit & adjust Voltage/Frequency GPU Power Monitor calculates GPU Power | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 GPU Core2 If Temp < Limit, increase power allocation In a multi-core design, apps running on CPU/GPU cores may consume less power Power-efficient algorithms exploit this power headroom for performance The GPU can borrow power credit from the CPU in GPU-centric scenarios and vice versa 16 GPU Core1
  17. 17. BAPM VERIFICATION APPROACH @ SOC • CPU Core1 CPU Power Monitor CPU Core2 CPU Power Monitor • • NB CAC Manager • • SMU F/W GPU Core1 GPU Power Monitor • GPU Core2 GPU Power Monitor • • Developed high and low power consuming CPU patterns based on micro-architecture and power analysis. Interleaved high and low power patterns in random stimulus Used an Irritator to manipulate the credits sent to CAC manager at times to hit corner cases like back-to-back boost/throttle Modeled F/W algorithm using a simple BFM Added CSR framework to drive read/write to CAC manager A very few sanity tests run with real f/w loaded through backdoor to check the end-to-end flow. Used irritators to model GPU power credit CPU-centric reporting instead of running GPU applications. GPU power monitor verified at GPU IP level Efficient Coverage-driven random verification  CPU boosted because of GPU giving away credits and vice versa  Crosses of CPU/GPU events and effect on BAPM 17 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Multiple Boost Pstates
  18. 18. SOC VERIFICATION CHALLENGES & SOLUTION 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  19. 19. TEST STIMULUS REUSE AND PORTING TO SOC Tool and flow differences/set-up across IP and SOC, make stimulus reuse difficult. Using functional model to simulate IP[RTL] in SoC scenario for IP test development and easy porting to SoC cMemory Memory Model Test setup update @ IP level to support test run with SOC as a new target Export suite, test key MPMM MEMIO Memory Model IP2SoC script UNB Perf options CPU to GPU access GPU C Model CPU C Model/RTL Bus Unit A simple HSA SOC test with 1 Rd-WR in RTL takes about 18 hours whereas it is <1 hour on the Heterogeneous C model Intent Capture and Playback methodology DV Test GPU C Model Test Output Common test options reports sim output run_job command-line options: directories GNB,XNB,UNB Goal: Improve Quality, Reduce development time 19 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Run/Execute Regression NB/DCT prog. options Test Output APU Create job spec [ip2soc –merge] Test setup update such as configuration changes, test stimulus defines allowed IP test to be reused. Capture Output Replay Capture Output Memory config Perf_options.yml
  20. 20. HW-SW INTERACTION: MODELING AND ABSTRACTION HW-SW INTERACTION : MODELING & ABSTRACTION Complex and evolving logic moving from hardware to firmware for better controllability. Challenges:  Firmware algorithms are compute-intensive and often developed late in design cycle.  Additional challenge to Verification in terms of load and execute time of the software. Connected Standby Verification Approach  Model the relevant section of the software using BFM with proper interface to the hardware  Add sufficient controllability to stress different paths of the BFM model - find coverage  Adaptive stimulus based on coverage of the BFM/state-machine Goals: Improve Quality, Reduce development time 20 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  21. 21. ADAPTIVE STIMULUS Typically, power management transitions kick off after active code execution stops. This results in deeper corner cases associated with thread-level coordination in multi-core design.  Predicting occurrences of deeper phases and targeting those by code/stimulus is difficult.  Define the power management modes as state machines - each state having granular phases including thread specific information.  Dynamic irritator monitors these state transitions, inserts random/directed asynchronous events (like different sorts of interrupts, probes, warmreset) and updates a scoreboard.  Events are generated very close to the relevant points - provides great controllability.  Dynamic irritator adapts based on scoreboard statistics - eventually putting more weightage to the less frequently covered <state> X <event> buckets. Goals: Improve Quality, Reduce development time 21 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  22. 22. CONSTRAINT RANDOM STIMULUS AND RANDOMIZATION AT SOC Random Initial States S11 S21 St St S23 St St Ref[3] Complex SoC requires Randomization at different levels SOC Constraints IP Constraints Register Fuse Modes: LFBR, BfD,long_init/ unfused test Run Build Randomization utility Package level info RandomConfig executable Time t=0 [config values Import value after reset CMD line options Goals: Improve Quality, Reduce development time 22 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  23. 23. OVERCOMING LIMITATIONS OF GATE-LEVEL SIMULATION Challenges with Netlist simulation : Longer run-times Longer debug times  Approach to minimize runtime: Compute intensive RTL and associated verification components must be replaced with a less intensive test-vector applicator : Apply test vectors directly from FSDB file. Create Gatesim files (gatesim.v,forces.v ) Run RTL simulation,get FSDB Build w Netlist + Gatesim files + TB to drive stimulus from FSDB 10x runtime optimization over traditional approach.  Approach to minimize Debug effort: Verdi NPI based Methodology to automate Debug: Ref [5] 23 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Goals: Improve Quality, Reduce development time Run Netlist sims(with FSDB dump)
  24. 24. THANKYOU | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  25. 25. REFERENCES [1] A New Parallel Computing Platform – HSA, CTHPC 2013 Keynote Speech, Roy Ju, AMD Senior Fellow [2] AMD APUs :Dynamic Power Management Techniques, DAC 2013. Praveen Dongara, System Architect [3] Wilson Research Group-MGC 2013. [4] Kaveri DTP. Internal Document. [5] Innovative Approach to Overcome Limitations of Netlist Simulation, SUNG 2013. Prodip K, Pankaj S,Meera M, Narendran K 25 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  26. 26. GLOSSARY  GPU – Graphics processing unit  APU -- Accelerated Processing Unit  Open CL™ -- Open Computing Language  TDP – Thermal Design power – a measure of a design infrastructure’s ability to cool a device  AMD Turbo Core Technology – AMD boost mechanism  BIAPM -- Bi-directional Application Power Management.  Cac -- Capacitance AC switching, measures switching activity of a cluster  TDP -- Thermal Design Power, represents the average thermal dissipation power required to cool the design  Pstate -- Processor performance state  GARLIC -- Graphic Accelerated Reduced Latency Integrated Channel  ONION -- On-chip Northbridge to I/O Noncoherent bus  FSDB – Fast Signal Database 26 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  27. 27. BACKUP 27 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  28. 28. DYNAMIC FINE-GRAINED POWER TRANSFERS The dynamically calculated temperature of each core and the GPU enables the operating point of each to be dynamically balanced in-order to maximize performance within temperature limits. Low activity in one core enables it to be a thermal sink for a more active core 100.0 100.0 95.0 95.0 90.0 95.0 90.0 85.0 90.0 85.0 80.0 85.0 80.0 75.0 80.0 75.0 GPU-centric 28 100.0 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 75.0 Balanced Ref [2] CPU-centric
  29. 29. Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD makes no representations or warranties with respect to the contents hereof and assumes no responsibility for any inaccuracies, errors or omissions that appear in this information. AMD specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. In no event will AMD be liable to any person for any direct, indirect, special or other consequential damages arising from the use of any information contained herein, even if AMD is expressly advised of the possibility of such damages. Trademark Attribution AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Open CL and the Open CL logo are trademarks of Apple, Inc. and used by permission or Khronos. Microsoft, Windows and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2011 Advanced Micro Devices, Inc. All rights reserved. 29 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

×