Anatomy of a Texture Fetch

presented for Dr. Bajaj’s graphics class
University of Texas, Austin
April 29, 2010

Anatomy of a
Texture Fetch
Mark J. Kilgard
mjk@nvidia.com

About Me

Principal System Software Engineer
NVIDIA Distinguished Inventor
Native Texan, living in Austin
But instead of UT, attended Rice (C.S. 1991)
Yet I root for Texas in football (unless playing Rice)
Interested in
Programmable shading languages for GPUs
Novel GPU-accelerated rendering paradigms
OpenGL & GPU Architecture Improvements

Our Topic: the Texture Fetch Dissected

Texture Fetch
All modern 3D games rely on texture-rich rendering
Basis for all sorts of richness and detail
More than graphics—GPU compute supports texture fetches
Conceptually
An image load with built-in re-sampling
GeForce 480 capable of 42,000,000,000 per second! *

* 700 Mhz clock × 15 Streaming Multiprocessors × 4 fetches per clock

A Texture Fetch Simplified

Fetch at
Seems pretty (s,t) = (0.6, 0.25)
simple…
Given
1. An image
2. A position
Return the
color of image at
position

RGBA Result is
0.95,0.4,0.24,1.0)

Texture Supplies Detail to Rendered Scenes

With texture

Without texture

Textures Make Graphics Pretty

Texture → detail,
detail → immersion,
immersion → fun

Microsoft Flight Simulator X

Shaders Combine Multiple Textures
decal only


(modulate)

lightmaps only

=
* Id Software’s Quake 2
combined scene circa 1997

Projected Texturing for Shadow Mapping

without light with
position
shadows shadows

“what the
light sees” Depth map from light’s point of view
is re-used as a texture and
re-projected into eye’s view
to generate shadows

Shadow Mapping Explained
Planar distance from light Depth map projected onto scene

≤ =
less equals
than

True “un-shadowed”
region shown green

Texture’s Not All Fun and Games

What’s so hard?

Filtering
Poor quality results if you just return the closest color sample in the
image
Bilinear filtering + mipmapping needed
Complications
Wrap modes, formats, compression, color spaces, other dimensionalities
(1D, 3D, cube maps), etc.
Gotta be quick
Applications desire billions of fetches per second
What’s done per-fragment in the shader, must be done per-texel in the
texture fetch—so 8x times as much work!
Essentially a miniature, real-time re-sampling kernel

Anatomy of a Texture Fetch

Texture images
Texel Texel
offsets data

Texture
coordinate Filtered
vector Texel Texel texel vector
Selection Combination
Combination
parameters

Texture parameters

Texture Fetch Functionality (1)
Texture coordinate processing
Projective texturing (OpenGL 1.0)
Cube map face selection (OpenGL 1.3)
Texture array indexing (OpenGL 2.1)
Coordinate scale: normalization (ARB_texture_rectangle)
Level-of-detail (LOD) computation
Log of maximum texture coordinate partial derivative (OpenGL 1.0)
LOD clamping (OpenGL 1.2)
LOD bias (OpenGL 1.3)
Anisotropic scaling of partial derivatives (SGIX_texture_lod_bias)
Wrap modes
Repeat, clamp (OpenGL 1.0)
Clamp to edge (OpenGL 1.2), Clamp to border (OpenGL 1.3)
Mirrored repeat (OpenGL 1.4)
Fully generalized clamped mirror repeat (EXT_texture_mirror_clamp)
Wrap to adjacent cube map face
Region clamp & mirror (PlayStation 2)

Arrays of 2D Textures

Multiple skins packed in texture array
Motivation: binding to one multi-skin texture array avoids
texture bind per object

Texture array index

0 1 2 3 4
Mipmap level index

0

1

2
3
4


Filter modes
Minification / magnification transition (OpenGL 1.0)
Nearest, linear, mipmap (OpenGL 1.0)
1D & 2D (OpenGL 1.0), 3D (OpenGL 1.2), 4D (SGIS_texture4D)
Anisotropic (EXT_texture_filter_anisotropic)
Fixed-weights: Quincunx, 3x3 Gaussian
Used for multi-sample resolves
Detail texture magnification (SGIS_detail_texture)
Sharpen texture magnification (SGIS_sharpen_texture)
4x4 filter (SGIS_texture_filter4)
Sharp-edge texture magnification (E&S Harmony)
Floating-point texture filtering (ARB_texture_float, OpenGL 3.0)


Texture formats
Uncompressed
Packing: RGBA8, RGB5A1, etc. (OpenGL 1.1)
Type: unsigned, signed (NV_texture_shader)
Normalized: fixed-point vs. integer (OpenGL 3.0)
Compressed
DXT compression formats (EXT_texture_compression_s3tc)
4:2:2 video compression (various extensions)
1- and 2-component compression (EXT_texture_compression_latc,
OpenGL 3.0)
Other approaches: IDCT, VQ, differential encoding, normal maps, separable
decompositions
Alternate encodings
RGB9 with 5-bit shared exponent (EXT_texture_shared_exponent)
Spherical harmonics
Sum of product decompositions


Pre-filtering operations
Gamma correction (OpenGL 2.1)
Table: sRGB / arbitrary
Shadow map comparison (OpenGL 1.4)
Compare functions: LEQUAL, GREATER, etc. (OpenGL 1.5)
Needs “R” depth value per texel
Palette lookup (EXT_paletted_texture)
Thresh-holding
Color key
Generalized thresh-holding

Delicate Color Fidelity with sRGB
Problem: PC display devices have a legacy non-linear (sRGB)
display gamut—delicate color shading looks wrong
Conventional Gamma correct
rendering (sRGB rendered)
(uncorrected
color)
Softer and more natural

Unnaturally
Implication for texture
deep facial
hardware: Should perform
shadows
sRGB-to-linear color space
convert per-texel, so 24
scalar conversions—or
more—per fetch
NVIDIA’s Adriana GeForce 8 Launch Demo


Optimizations
Level-of-detail weighting adjustments
Mid-maps (extra pre-filtered levels in-between existing levels)
Unconventional uses
Bitmap textures for fonts with large filters (Direct3D 10)
Rip-mapping
Non-uniform texture border color
Clip-mapping (SGIX_clipmap)
Multi-texel borders
Silhouette maps (Pardeep Sen’s work)
Shadow mapping
Sharp piecewise linear magnification

Phased Data Flow
Must hide long memory read latency between Selection
and Combination phases
Texture images
Memory
Texel Texel
reads for
offsets data
samples

Filtered
texel
Texture vector
Texel Texel
coordinate
Selection Combination
vector Combination
parameters
FIFOing of
combination
parameters Texture parameters

What really happens?

Let’s consider a simple tri-linear mip-mapped 2D
projective texture fetch High-level language
Logically just one instruction statement (Cg/HLSL)
float4 color = tex2Dproj(decalSampler, st);
TXP o[COLR], f[TEX3], TEX2, 2D;

Logically Assembly instruction
Texel selection (NV_fragment_program)
Texel combination
How many operations are involved?

Medium-Level Dissection
of a Texture Fetch
texture images

interpolated
texture coords
vector
texel texel data
offsets

integer integer /
coords & fixed-point filtered
Convert Convert texel
texel fractional texel
texture weights
texel integer / intermediates
floating-point vector
coords
coords floor / coords fixed-point scaling
to frac to texel and
texel texel combination combination combination
parameters
coords offsets

texture parameters

Interpolation

First we need to interpolate (s,t,r,q)
This is the f[TEX3] part of the TXP instruction
Projective texturing means we want (s/q, t/q)
And possible r/q if shadow mapping
In order to correct for perspective, hardware actually interpolates
(s/w, t/w, r/w, q/w)
If not projective texturing, could linearly interpolate inverse w (or 1/w)
Then compute its reciprocal to get w Observe projective
Since 1/(1/w) equals w texturing is same cost
Then multiply (s/w,t/w,r/w,q/w) times w as perspective
correction
To get (s,t,r,q)
If projective texturing, we can instead
Compute reciprocal of q/w to get w/q
Then multiple (s/w,t/w,r/w) by w/q to get (s/q, t/q, r/q)

Interpolation Operations

Ax + By + C per scalar linear interpolation
2 MADs
One reciprocal to invert q/w for projective texturing
Or one reciprocal to invert 1/w for perspective texturing
Then 1 MUL per component for s/w * w/q
Or s/w * w
For (s,t) means
4 MADs, 2 MULs, & 1 RCP
(s,t,r) requires 6 MADs, 3 MULs, & 1 RCP
All floating-point operations

Texture Space Mapping

Have interpolated & projected coordinates
Now need to determine what texels to fetch

Multiple (s,t) by (width,height) of texture base level
Could convert (s,t) to fixed-point first
Or do math in floating-point
Say based texture is 256x256 so
So compute (s*256, t*256)=(u,v)

Mipmap Level-of-detail Selection

Tri-linear mip-mapping means compute
appropriate mipmap level
Hardware rasterizes in 2x2 pixel entities
Typically called quad-pixels or just quad
Finite difference with neighbors to get change in
u and v with respect to window space
Approximation to ∂u/∂x, ∂u/∂y, ∂v/∂x, ∂v/∂y one-pixel separation
Means 4 subtractions per quad (1 per pixel)
Now compute approximation to gradient
length
p = max(sqrt((∂u/∂x)2+(∂u/∂y)2),
sqrt((∂v/∂x)2+(∂v/∂y)2))

Level-of-detail Bias and Clamping

Convert p length to power-of-two level-of-detail and
apply LOD bias
λ = log2(p) + lodBias
Now clamp λ to valid LOD range
λ’ = max(minLOD, min(maxLOD, λ))

Determine Mipmap Levels and
Level Filtering Weight

Determine lower and upper mipmap levels
b = floor(λ’)) is bottom mipmap level
t = floor(λ’+1) is top mipmap level
Determine filter weight between levels
w = frac(λ’) is filter weight

Determine Texture Sample Point

Get (u,v) for selected top and bottom mipmap levels
Consider a level l which could be either level t or b
With (u,v) locations (ul,vl)
Perform GL_CLAMP_TO_EDGE wrap modes
uw = max(1/2*widthOfLevel(l),
min(1-1/2*widthOfLevel(l), u))
vw = max(1/2*heightOfLevel(l),
min(1-1/2*heightOfLevel(l), v)) t
edge

Get integer location (i,j) within each level s
border

(i,j) = ( floor(uw* widthOfLevel(l)),
floor(vw* ) )

Determine Texel Locations

Bilinear sample needs 4 texel locations
(i0,j0), (i0,j1), (i1,j0), (i1,j1)
With integer texel coordinates
i0 = floor(i-1/2)
i1 = floor(i+1/2)
j0 = floor(j-1/2)
j1 = floor(j+1/2)
Also compute fractional weights for bilinear filtering
a = frac(i-1/2)
b = frac(j-1/2)

Determine Texel Addresses

Assuming a texture level image’s base pointer, compute a texel
address of each texel to fetch
Assume bytesPerTexel = 4 bytes for RGBA8 texture
Example
addr00 = baseOfLevel(l) +
bytesPerTexel*(i0+j0*widthOfLevel(l))
More complicated address schemes are needed for good texture
locality!

Initiate Texture Reads

Initiate texture memory reads at the 8 texel addresses
addr00, addr01, addr10, addr11 for the upper level
addr00, addr01, addr10, addr11 for the lower level
Queue the weights a, b, and w
Latency FIFO in hardware makes these weights available
when texture reads complete

Texel Combination
When texels reads are returned, begin filtering
Assume results are
Top texels: t00, t01, t10, t11
Bottom texels: b00, b01, b10, b11
Per-component filtering math is tri-linear filter
RGBA8 is four components
result = (1-a)*(1-b)*(1-w)*b00 +
(1-a)*b*(1-w)*b*b01 +
a*(1-b)*(1-w)*b10 +
a*b*(1-w)*b11 +
(1-a)*(1-b)*w*t00 +
(1-a)*b*w*t01 +
a*(1-b)*w*t10 +
a*b*w*t11;
24 MADs per component, or 96 for RGBA
Lerp-tree could do 14 MADs per component, or 56 for RGBA

Total Texture Fetch Operations

Interpolation
6 MADs, 3 MULs, & 1 RCP (floating-point)
Texel selection
Texture space mapping
2 MULs (fixed-point)
LOD determination (floating-point)
1 pixel difference, 2 SQRTs, 4 MULs, 1 LOG2
LOD bias and clamping (fixed-point)
1 ADD, 1 MIN, 1 MAX
Assuming a fixed-point RGBA
Level determination and level weighting (fixed-point) tri-linear mipmap filtered
1 FLOOR, 1 ADD, 1 FRAC projective texture fetch
Texture sample point
4 MAXs, 4 MINs, 2 FLOORs (fixed-point)
Texel locations and bi-linear weights
8 FLOORs, 4 FRACs, 8 ADDs (fixed-point)
Addressing
16 integer MADs (integer)
Texel combination
56 fixed-point MADs (fixed-point)

Observations about the Texture Fetch

Lots of ways to implement the math
Lots of clever ways to be efficient
Lots more texture operations not considered in this analysis
Compression
Anisotropic filtering
sRGB
Shadow mapping
Arguably TEX instructions are “world’s most CISC instructions”
Texture fetches are incredibly complex instructions
Good deal of GPU’s superiority at graphics operations over CPUs
is attributable to TEX instruction efficiency
Good for compute too

Take Away Information

The GPU texture fetch is about two orders of magnitude
more complex than the most complex CPU instruction
And texture fetches are extremely common
Dozens of billions of texture fetches are expected by modern
GPU applications
Not just a graphics thing
Using CUDA, you can access textures from within your
compute- and bandwidth-intensive parallel kernels

GPU Technology Conference 2010
Monday, Sept. 20 - Thurs., Sept. 23, 2010 Opportunities
San Jose Convention Center, San Jose, California
Call for Submissions
The most important event in the GPU ecosystem Sessions & posters
Learn about seismic shifts in GPU computing
Preview disruptive technologies and emerging applications Sponsors / Exhibitors
Reach decision makers
Get tools and techniques to impact mission critical projects
Network with experts, colleagues, and peers across industries “CEO on Stage”
Showcase for Startups
“I consider the GPU Technology Conference to be the single best place to see the amazing Tell your story to VCs and
work enabled by the GPU. It’s a great venue for meeting researchers, developers, scientists,
and entrepreneurs from around the world.” analysts

-- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker

Anatomy of a Texture Fetch

In this document

More Related Content

What's hot

Viewers also liked

Similar to Anatomy of a Texture Fetch

More from Mark Kilgard

Recently uploaded

Anatomy of a Texture Fetch