This technical note discusses optimizing CPU and GPU interactions for peak system performance. It recommends balancing resources like processing engines, memory, and I/O to avoid bottlenecks. Latency considerations are important, including interconnect latency, memory controller latency, and ensuring high priority blocks like the display controller are not starved of data. The number of outstanding GPU requests must also be calculated correctly based on latency and burst length to avoid the GPU being data starved.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
GPU Latency Analysis
1. Technical Note
For Vivante GC Cores
Latency and Outstanding Request Analysis
Version 1.0
January 2012
2. Technical Note for Vivante GC Cores
Revision History
Version Date Author Description
1.0 January 16, 2012 Benson Tao -Initial release
Page 2 of 4 Ver. 1.0 / January 2012
3. Technical Note for Vivante GC Cores
1 General System Performance Analysis
Q: Can you please provide additional details on load balancing and recommendations on optimizing CPU and GPU
interactions for peak system performance?
In a graphics subsystem, an application such as a 3D game or fancy GUI calls different APIs to access the graphics hardware
through programming calls to the operating system. When the application requests an image to be rendered onscreen, the
API calls the OS which in turn invokes the GPU driver to communicate with the GPU hardware to draw the image to be shown
onscreen. From a CPU perspective, the CPU accumulates and sets up graphics commands that are dispatched to the GPU for
processing and display rendering. As graphics performance and screen/object details increase with advances in technology an
optimized method for CPU-GPU communication must be maintained to ensure external/internal bandwidth availability,
internal communications with minimal latencies, cache coherence, and optimizations (system, processor, GPU), access
priorities for different system blocks and the GPU/CPU (ex. starving the display controller from accessing the GPU screen data
will cause display flickering and a negative user experience).
In general, since each SoC design has specific requirements, there needs to be a balance between all resources,
communications (AXI/AHB, OCP, NoC, proprietary, etc.), memory interfaces, OS/system specific optimizations, chip floor
planning, graphics compression technologies, and effective use of memories in the hierarchy (registers, caches, system
memories, efficient banking, coherency, single/dual port RAMS, access speeds, etc.). Since each design is customized, we help
provide test vectors and performance traces to test the graphics subsystem which can be included in a customer’s full chip
test simulation to test overall system loading. We also work with the customer directly based on application
type/usage/requirements in addition to analyzing CPU and system resources to recommend the best graphics core (and any
optimizations/derivative cores) based on power, size, performance, and features.
2 Latency Analysis
SoC architectures consist of multiple processing engines (CPU, 3D, video, etc.), memory units (DDR, caches, etc.) and I/O
blocks (network, USB, HDD, Flash, wireless, RF, etc.). All these functional blocks integrated into the SoC define the product,
target market, performance, features and key differentiations that are key selling points of the chip and end device. Each
processing engine like the CPU, video and 3D graphics have different (data) traffic profiles which effect bandwidth, latency
and performance requirements. A well-defined SoC interconnect scheme capable of handling multiple processing units in
parallel with minimal or no performance degradation is ideal in today’s multi-tasking environment (ex. video + composition or
video + 3D graphics + composition). In addition to the efficient interconnect (AXI, ACE-Lite, NoC, etc.) design, a well
architected memory controller subsystem is required to match the bandwidth, latency and data transfer requirements
needed by each processing unit. A designer does not want to have a high speed interconnect coupled with a low speed
memory controller (MC) and memory bus (speed, width), since the MC will be the bottleneck.
As system complexity increases, the overall performance is determined by the engine speed, interconnect design, and
memory system to provide sustained data bandwidth to the engine while also meeting the real time performance goals of
latency sensitive traffic. To design an optimized system, all parts of the data transfer from initiator to destination must be
analyzed in totality. In the following paragraphs we will focus on latency considerations along with the number of outstanding
requests.
Latency in a GPU design takes into account interconnect (bus) and data latencies. Bus latency is the total roundtrip latency
from the GPU (initiator) through the interconnect external DDR memory back to the GPU, as depicted below:
Page 3 of 4 Ver. 1.0 / January 2012
4. Technical Note for Vivante GC Cores
The total bus latency is the addition of all the latency summed as a request or data is transferred from the GPU. In the image
above, the total latency is the sum of the latency from one to six, which includes the latency through the interconnect fabric,
memory controller, and DDR memories. The latency for each component is system dependent, with a minimal latency (in
GPU clock cycles) required for optimal performance.
The second part is data latency which depends on priorities assigned to each functional unit. This data delay includes the
number of GPU cycles required to receive data through arbitration. In one example, the display controller can be assigned the
highest priority since it needs to refresh the screen. If the display controller is not assigned a high priority and all devices
request access to the interconnect or data in external DDR memory, then the display controller may be starved of data and
cause a display refresh glitch or incorrect display update. This display glitch will cause negative user impact since it is easily
visible. Engine priorities need to be balanced and analyzed to ensure sufficient latency and bandwidth during peak data
access.
To overcome bus and data latency, the number of outstanding requests needs to be correctly defined to make sure the GPU
is not data starved waiting for data to come back from system memory. This can be determined by the total latency to ensure
the GPU latency is held under 200 cycles. The general formula for calculating the number of outstanding requests is:
, , , , , ,
, ,
OR = Number of Outstanding Requests
L = Latency in Bus Cycles = Lbus (Bus Latency) + Ldata (Data Latency)
N = Burst Length in Bus Cycles, Dependent on Bus Width and Bus Size (Bytes)
The number of outstanding requests in the GPU also affects the amount of storage available in the GPU for the FIFO, and we
can consider each entry in the FIFO as one outstanding transaction.
In the Vivante GC Core design, latency is hidden through other mechanisms including multi-threading, parallel execution, pre-
fetching, efficient use of cache, memory optimizations such as burst building, request merging, compression, smart banking
and other innovations. All these parts need to be considered when designing the GPU sub-system.
Page 4 of 4 Ver. 1.0 / January 2012