GPU Latency Analysis


Published on

GPU latency and outstanding request technical note

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

GPU Latency Analysis

  1. 1. Technical Note For Vivante GC CoresLatency and Outstanding Request Analysis Version 1.0 January 2012
  2. 2. Technical Note for Vivante GC CoresRevision HistoryVersion Date Author Description 1.0 January 16, 2012 Benson Tao -Initial release Page 2 of 4 Ver. 1.0 / January 2012
  3. 3. Technical Note for Vivante GC Cores1 General System Performance AnalysisQ: Can you please provide additional details on load balancing and recommendations on optimizing CPU and GPUinteractions for peak system performance?In a graphics subsystem, an application such as a 3D game or fancy GUI calls different APIs to access the graphics hardwarethrough programming calls to the operating system. When the application requests an image to be rendered onscreen, theAPI calls the OS which in turn invokes the GPU driver to communicate with the GPU hardware to draw the image to be shownonscreen. From a CPU perspective, the CPU accumulates and sets up graphics commands that are dispatched to the GPU forprocessing and display rendering. As graphics performance and screen/object details increase with advances in technology anoptimized method for CPU-GPU communication must be maintained to ensure external/internal bandwidth availability,internal communications with minimal latencies, cache coherence, and optimizations (system, processor, GPU), accesspriorities for different system blocks and the GPU/CPU (ex. starving the display controller from accessing the GPU screen datawill cause display flickering and a negative user experience).In general, since each SoC design has specific requirements, there needs to be a balance between all resources,communications (AXI/AHB, OCP, NoC, proprietary, etc.), memory interfaces, OS/system specific optimizations, chip floorplanning, graphics compression technologies, and effective use of memories in the hierarchy (registers, caches, systemmemories, efficient banking, coherency, single/dual port RAMS, access speeds, etc.). Since each design is customized, we helpprovide test vectors and performance traces to test the graphics subsystem which can be included in a customer’s full chiptest simulation to test overall system loading. We also work with the customer directly based on applicationtype/usage/requirements in addition to analyzing CPU and system resources to recommend the best graphics core (and anyoptimizations/derivative cores) based on power, size, performance, and features.2 Latency AnalysisSoC architectures consist of multiple processing engines (CPU, 3D, video, etc.), memory units (DDR, caches, etc.) and I/Oblocks (network, USB, HDD, Flash, wireless, RF, etc.). All these functional blocks integrated into the SoC define the product,target market, performance, features and key differentiations that are key selling points of the chip and end device. Eachprocessing engine like the CPU, video and 3D graphics have different (data) traffic profiles which effect bandwidth, latencyand performance requirements. A well-defined SoC interconnect scheme capable of handling multiple processing units inparallel with minimal or no performance degradation is ideal in today’s multi-tasking environment (ex. video + composition orvideo + 3D graphics + composition). In addition to the efficient interconnect (AXI, ACE-Lite, NoC, etc.) design, a wellarchitected memory controller subsystem is required to match the bandwidth, latency and data transfer requirementsneeded by each processing unit. A designer does not want to have a high speed interconnect coupled with a low speedmemory controller (MC) and memory bus (speed, width), since the MC will be the bottleneck.As system complexity increases, the overall performance is determined by the engine speed, interconnect design, andmemory system to provide sustained data bandwidth to the engine while also meeting the real time performance goals oflatency sensitive traffic. To design an optimized system, all parts of the data transfer from initiator to destination must beanalyzed in totality. In the following paragraphs we will focus on latency considerations along with the number of outstandingrequests.Latency in a GPU design takes into account interconnect (bus) and data latencies. Bus latency is the total roundtrip latencyfrom the GPU (initiator) through the interconnect external DDR memory back to the GPU, as depicted below: Page 3 of 4 Ver. 1.0 / January 2012
  4. 4. Technical Note for Vivante GC CoresThe total bus latency is the addition of all the latency summed as a request or data is transferred from the GPU. In the imageabove, the total latency is the sum of the latency from one to six, which includes the latency through the interconnect fabric,memory controller, and DDR memories. The latency for each component is system dependent, with a minimal latency (inGPU clock cycles) required for optimal performance.The second part is data latency which depends on priorities assigned to each functional unit. This data delay includes thenumber of GPU cycles required to receive data through arbitration. In one example, the display controller can be assigned thehighest priority since it needs to refresh the screen. If the display controller is not assigned a high priority and all devicesrequest access to the interconnect or data in external DDR memory, then the display controller may be starved of data andcause a display refresh glitch or incorrect display update. This display glitch will cause negative user impact since it is easilyvisible. Engine priorities need to be balanced and analyzed to ensure sufficient latency and bandwidth during peak dataaccess.To overcome bus and data latency, the number of outstanding requests needs to be correctly defined to make sure the GPUis not data starved waiting for data to come back from system memory. This can be determined by the total latency to ensurethe GPU latency is held under 200 cycles. The general formula for calculating the number of outstanding requests is: , , , , , , , ,OR = Number of Outstanding RequestsL = Latency in Bus Cycles = Lbus (Bus Latency) + Ldata (Data Latency)N = Burst Length in Bus Cycles, Dependent on Bus Width and Bus Size (Bytes)The number of outstanding requests in the GPU also affects the amount of storage available in the GPU for the FIFO, and wecan consider each entry in the FIFO as one outstanding transaction.In the Vivante GC Core design, latency is hidden through other mechanisms including multi-threading, parallel execution, pre-fetching, efficient use of cache, memory optimizations such as burst building, request merging, compression, smart bankingand other innovations. All these parts need to be considered when designing the GPU sub-system. Page 4 of 4 Ver. 1.0 / January 2012