Your SlideShare is downloading. ×
  • Like
GPU Latency Analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

GPU Latency Analysis

  • 740 views
Published

GPU latency and outstanding request technical note

GPU latency and outstanding request technical note

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
740
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
11
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Technical Note For Vivante GC CoresLatency and Outstanding Request Analysis Version 1.0 January 2012
  • 2. Technical Note for Vivante GC CoresRevision HistoryVersion Date Author Description 1.0 January 16, 2012 Benson Tao -Initial release Page 2 of 4 Ver. 1.0 / January 2012
  • 3. Technical Note for Vivante GC Cores1 General System Performance AnalysisQ: Can you please provide additional details on load balancing and recommendations on optimizing CPU and GPUinteractions for peak system performance?In a graphics subsystem, an application such as a 3D game or fancy GUI calls different APIs to access the graphics hardwarethrough programming calls to the operating system. When the application requests an image to be rendered onscreen, theAPI calls the OS which in turn invokes the GPU driver to communicate with the GPU hardware to draw the image to be shownonscreen. From a CPU perspective, the CPU accumulates and sets up graphics commands that are dispatched to the GPU forprocessing and display rendering. As graphics performance and screen/object details increase with advances in technology anoptimized method for CPU-GPU communication must be maintained to ensure external/internal bandwidth availability,internal communications with minimal latencies, cache coherence, and optimizations (system, processor, GPU), accesspriorities for different system blocks and the GPU/CPU (ex. starving the display controller from accessing the GPU screen datawill cause display flickering and a negative user experience).In general, since each SoC design has specific requirements, there needs to be a balance between all resources,communications (AXI/AHB, OCP, NoC, proprietary, etc.), memory interfaces, OS/system specific optimizations, chip floorplanning, graphics compression technologies, and effective use of memories in the hierarchy (registers, caches, systemmemories, efficient banking, coherency, single/dual port RAMS, access speeds, etc.). Since each design is customized, we helpprovide test vectors and performance traces to test the graphics subsystem which can be included in a customer’s full chiptest simulation to test overall system loading. We also work with the customer directly based on applicationtype/usage/requirements in addition to analyzing CPU and system resources to recommend the best graphics core (and anyoptimizations/derivative cores) based on power, size, performance, and features.2 Latency AnalysisSoC architectures consist of multiple processing engines (CPU, 3D, video, etc.), memory units (DDR, caches, etc.) and I/Oblocks (network, USB, HDD, Flash, wireless, RF, etc.). All these functional blocks integrated into the SoC define the product,target market, performance, features and key differentiations that are key selling points of the chip and end device. Eachprocessing engine like the CPU, video and 3D graphics have different (data) traffic profiles which effect bandwidth, latencyand performance requirements. A well-defined SoC interconnect scheme capable of handling multiple processing units inparallel with minimal or no performance degradation is ideal in today’s multi-tasking environment (ex. video + composition orvideo + 3D graphics + composition). In addition to the efficient interconnect (AXI, ACE-Lite, NoC, etc.) design, a wellarchitected memory controller subsystem is required to match the bandwidth, latency and data transfer requirementsneeded by each processing unit. A designer does not want to have a high speed interconnect coupled with a low speedmemory controller (MC) and memory bus (speed, width), since the MC will be the bottleneck.As system complexity increases, the overall performance is determined by the engine speed, interconnect design, andmemory system to provide sustained data bandwidth to the engine while also meeting the real time performance goals oflatency sensitive traffic. To design an optimized system, all parts of the data transfer from initiator to destination must beanalyzed in totality. In the following paragraphs we will focus on latency considerations along with the number of outstandingrequests.Latency in a GPU design takes into account interconnect (bus) and data latencies. Bus latency is the total roundtrip latencyfrom the GPU (initiator) through the interconnect external DDR memory back to the GPU, as depicted below: Page 3 of 4 Ver. 1.0 / January 2012
  • 4. Technical Note for Vivante GC CoresThe total bus latency is the addition of all the latency summed as a request or data is transferred from the GPU. In the imageabove, the total latency is the sum of the latency from one to six, which includes the latency through the interconnect fabric,memory controller, and DDR memories. The latency for each component is system dependent, with a minimal latency (inGPU clock cycles) required for optimal performance.The second part is data latency which depends on priorities assigned to each functional unit. This data delay includes thenumber of GPU cycles required to receive data through arbitration. In one example, the display controller can be assigned thehighest priority since it needs to refresh the screen. If the display controller is not assigned a high priority and all devicesrequest access to the interconnect or data in external DDR memory, then the display controller may be starved of data andcause a display refresh glitch or incorrect display update. This display glitch will cause negative user impact since it is easilyvisible. Engine priorities need to be balanced and analyzed to ensure sufficient latency and bandwidth during peak dataaccess.To overcome bus and data latency, the number of outstanding requests needs to be correctly defined to make sure the GPUis not data starved waiting for data to come back from system memory. This can be determined by the total latency to ensurethe GPU latency is held under 200 cycles. The general formula for calculating the number of outstanding requests is: , , , , , , , ,OR = Number of Outstanding RequestsL = Latency in Bus Cycles = Lbus (Bus Latency) + Ldata (Data Latency)N = Burst Length in Bus Cycles, Dependent on Bus Width and Bus Size (Bytes)The number of outstanding requests in the GPU also affects the amount of storage available in the GPU for the FIFO, and wecan consider each entry in the FIFO as one outstanding transaction.In the Vivante GC Core design, latency is hidden through other mechanisms including multi-threading, parallel execution, pre-fetching, efficient use of cache, memory optimizations such as burst building, request merging, compression, smart bankingand other innovations. All these parts need to be considered when designing the GPU sub-system. Page 4 of 4 Ver. 1.0 / January 2012