2024: Domino Containers - The Next Step. News from the Domino Container commu...
Anatomy of a Texture Fetch
1. presented for Dr. Bajaj’s graphics class
University of Texas, Austin
April 29, 2010
Anatomy of a
Texture Fetch
Mark J. Kilgard
mjk@nvidia.com
2. About Me
Principal System Software Engineer
NVIDIA Distinguished Inventor
Native Texan, living in Austin
But instead of UT, attended Rice (C.S. 1991)
Yet I root for Texas in football (unless playing Rice)
Interested in
Programmable shading languages for GPUs
Novel GPU-accelerated rendering paradigms
OpenGL & GPU Architecture Improvements
3. Our Topic: the Texture Fetch Dissected
Texture Fetch
All modern 3D games rely on texture-rich rendering
Basis for all sorts of richness and detail
More than graphics—GPU compute supports texture fetches
Conceptually
An image load with built-in re-sampling
GeForce 480 capable of 42,000,000,000 per second! *
* 700 Mhz clock × 15 Streaming Multiprocessors × 4 fetches per clock
4. A Texture Fetch Simplified
Fetch at
Seems pretty (s,t) = (0.6, 0.25)
simple…
Given
1. An image
2. A position
Return the
color of image at
position
RGBA Result is
0.95,0.4,0.24,1.0)
6. Textures Make Graphics Pretty
Texture → detail,
detail → immersion,
immersion → fun
Microsoft Flight Simulator X
7. Shaders Combine Multiple Textures
decal only
(modulate)
lightmaps only
=
* Id Software’s Quake 2
combined scene circa 1997
8. Projected Texturing for Shadow Mapping
without light with
position
shadows shadows
“what the
light sees” Depth map from light’s point of view
is re-used as a texture and
re-projected into eye’s view
to generate shadows
9. Shadow Mapping Explained
Planar distance from light Depth map projected onto scene
≤ =
less equals
than
True “un-shadowed”
region shown green
11. What’s so hard?
Filtering
Poor quality results if you just return the closest color sample in the
image
Bilinear filtering + mipmapping needed
Complications
Wrap modes, formats, compression, color spaces, other dimensionalities
(1D, 3D, cube maps), etc.
Gotta be quick
Applications desire billions of fetches per second
What’s done per-fragment in the shader, must be done per-texel in the
texture fetch—so 8x times as much work!
Essentially a miniature, real-time re-sampling kernel
12. Anatomy of a Texture Fetch
Texture images
Texel Texel
offsets data
Texture
coordinate Filtered
vector Texel Texel texel vector
Selection Combination
Combination
parameters
Texture parameters
18. Delicate Color Fidelity with sRGB
Problem: PC display devices have a legacy non-linear (sRGB)
display gamut—delicate color shading looks wrong
Conventional Gamma correct
rendering (sRGB rendered)
(uncorrected
color)
Softer and more natural
Unnaturally
Implication for texture
deep facial
hardware: Should perform
shadows
sRGB-to-linear color space
convert per-texel, so 24
scalar conversions—or
more—per fetch
NVIDIA’s Adriana GeForce 8 Launch Demo
19. Texture Fetch Functionality (5)
Optimizations
Level-of-detail weighting adjustments
Mid-maps (extra pre-filtered levels in-between existing levels)
Unconventional uses
Bitmap textures for fonts with large filters (Direct3D 10)
Rip-mapping
Non-uniform texture border color
Clip-mapping (SGIX_clipmap)
Multi-texel borders
Silhouette maps (Pardeep Sen’s work)
Shadow mapping
Sharp piecewise linear magnification
20. Phased Data Flow
Must hide long memory read latency between Selection
and Combination phases
Texture images
Memory
Texel Texel
reads for
offsets data
samples
Filtered
texel
Texture vector
Texel Texel
coordinate
Selection Combination
vector Combination
parameters
FIFOing of
combination
parameters Texture parameters
21. What really happens?
Let’s consider a simple tri-linear mip-mapped 2D
projective texture fetch High-level language
Logically just one instruction statement (Cg/HLSL)
float4 color = tex2Dproj(decalSampler, st);
TXP o[COLR], f[TEX3], TEX2, 2D;
Logically Assembly instruction
Texel selection (NV_fragment_program)
Texel combination
How many operations are involved?
22. Medium-Level Dissection
of a Texture Fetch
texture images
interpolated
texture coords
vector
texel texel data
offsets
integer integer /
coords & fixed-point filtered
Convert Convert texel
texel fractional texel
texture weights
texel integer / intermediates
floating-point vector
coords
coords floor / coords fixed-point scaling
to frac to texel and
texel texel combination combination combination
parameters
coords offsets
texture parameters
23. Interpolation
First we need to interpolate (s,t,r,q)
This is the f[TEX3] part of the TXP instruction
Projective texturing means we want (s/q, t/q)
And possible r/q if shadow mapping
In order to correct for perspective, hardware actually interpolates
(s/w, t/w, r/w, q/w)
If not projective texturing, could linearly interpolate inverse w (or 1/w)
Then compute its reciprocal to get w Observe projective
Since 1/(1/w) equals w texturing is same cost
Then multiply (s/w,t/w,r/w,q/w) times w as perspective
correction
To get (s,t,r,q)
If projective texturing, we can instead
Compute reciprocal of q/w to get w/q
Then multiple (s/w,t/w,r/w) by w/q to get (s/q, t/q, r/q)
24. Interpolation Operations
Ax + By + C per scalar linear interpolation
2 MADs
One reciprocal to invert q/w for projective texturing
Or one reciprocal to invert 1/w for perspective texturing
Then 1 MUL per component for s/w * w/q
Or s/w * w
For (s,t) means
4 MADs, 2 MULs, & 1 RCP
(s,t,r) requires 6 MADs, 3 MULs, & 1 RCP
All floating-point operations
25. Texture Space Mapping
Have interpolated & projected coordinates
Now need to determine what texels to fetch
Multiple (s,t) by (width,height) of texture base level
Could convert (s,t) to fixed-point first
Or do math in floating-point
Say based texture is 256x256 so
So compute (s*256, t*256)=(u,v)
26. Mipmap Level-of-detail Selection
Tri-linear mip-mapping means compute
appropriate mipmap level
Hardware rasterizes in 2x2 pixel entities
Typically called quad-pixels or just quad
Finite difference with neighbors to get change in
u and v with respect to window space
Approximation to ∂u/∂x, ∂u/∂y, ∂v/∂x, ∂v/∂y one-pixel separation
Means 4 subtractions per quad (1 per pixel)
Now compute approximation to gradient
length
p = max(sqrt((∂u/∂x)2+(∂u/∂y)2),
sqrt((∂v/∂x)2+(∂v/∂y)2))
27. Level-of-detail Bias and Clamping
Convert p length to power-of-two level-of-detail and
apply LOD bias
λ = log2(p) + lodBias
Now clamp λ to valid LOD range
λ’ = max(minLOD, min(maxLOD, λ))
28. Determine Mipmap Levels and
Level Filtering Weight
Determine lower and upper mipmap levels
b = floor(λ’)) is bottom mipmap level
t = floor(λ’+1) is top mipmap level
Determine filter weight between levels
w = frac(λ’) is filter weight
29. Determine Texture Sample Point
Get (u,v) for selected top and bottom mipmap levels
Consider a level l which could be either level t or b
With (u,v) locations (ul,vl)
Perform GL_CLAMP_TO_EDGE wrap modes
uw = max(1/2*widthOfLevel(l),
min(1-1/2*widthOfLevel(l), u))
vw = max(1/2*heightOfLevel(l),
min(1-1/2*heightOfLevel(l), v)) t
edge
Get integer location (i,j) within each level s
border
(i,j) = ( floor(uw* widthOfLevel(l)),
floor(vw* ) )
30. Determine Texel Locations
Bilinear sample needs 4 texel locations
(i0,j0), (i0,j1), (i1,j0), (i1,j1)
With integer texel coordinates
i0 = floor(i-1/2)
i1 = floor(i+1/2)
j0 = floor(j-1/2)
j1 = floor(j+1/2)
Also compute fractional weights for bilinear filtering
a = frac(i-1/2)
b = frac(j-1/2)
31. Determine Texel Addresses
Assuming a texture level image’s base pointer, compute a texel
address of each texel to fetch
Assume bytesPerTexel = 4 bytes for RGBA8 texture
Example
addr00 = baseOfLevel(l) +
bytesPerTexel*(i0+j0*widthOfLevel(l))
addr01 = baseOfLevel(l) +
bytesPerTexel*(i0+j1*widthOfLevel(l))
addr10 = baseOfLevel(l) +
bytesPerTexel*(i1+j0*widthOfLevel(l))
addr11 = baseOfLevel(l) +
bytesPerTexel*(i1+j1*widthOfLevel(l))
More complicated address schemes are needed for good texture
locality!
32. Initiate Texture Reads
Initiate texture memory reads at the 8 texel addresses
addr00, addr01, addr10, addr11 for the upper level
addr00, addr01, addr10, addr11 for the lower level
Queue the weights a, b, and w
Latency FIFO in hardware makes these weights available
when texture reads complete
33. Phased Data Flow
Must hide long memory read latency between Selection
and Combination phases
Texture images
Memory
Texel Texel
reads for
offsets data
samples
Filtered
texel
Texture vector
Texel Texel
coordinate
Selection Combination
vector Combination
parameters
FIFOing of
combination
parameters Texture parameters
34. Texel Combination
When texels reads are returned, begin filtering
Assume results are
Top texels: t00, t01, t10, t11
Bottom texels: b00, b01, b10, b11
Per-component filtering math is tri-linear filter
RGBA8 is four components
result = (1-a)*(1-b)*(1-w)*b00 +
(1-a)*b*(1-w)*b*b01 +
a*(1-b)*(1-w)*b10 +
a*b*(1-w)*b11 +
(1-a)*(1-b)*w*t00 +
(1-a)*b*w*t01 +
a*(1-b)*w*t10 +
a*b*w*t11;
24 MADs per component, or 96 for RGBA
Lerp-tree could do 14 MADs per component, or 56 for RGBA
36. Observations about the Texture Fetch
Lots of ways to implement the math
Lots of clever ways to be efficient
Lots more texture operations not considered in this analysis
Compression
Anisotropic filtering
sRGB
Shadow mapping
Arguably TEX instructions are “world’s most CISC instructions”
Texture fetches are incredibly complex instructions
Good deal of GPU’s superiority at graphics operations over CPUs
is attributable to TEX instruction efficiency
Good for compute too
37. Take Away Information
The GPU texture fetch is about two orders of magnitude
more complex than the most complex CPU instruction
And texture fetches are extremely common
Dozens of billions of texture fetches are expected by modern
GPU applications
Not just a graphics thing
Using CUDA, you can access textures from within your
compute- and bandwidth-intensive parallel kernels
38. GPU Technology Conference 2010
Monday, Sept. 20 - Thurs., Sept. 23, 2010 Opportunities
San Jose Convention Center, San Jose, California
Call for Submissions
The most important event in the GPU ecosystem Sessions & posters
Learn about seismic shifts in GPU computing
Preview disruptive technologies and emerging applications Sponsors / Exhibitors
Reach decision makers
Get tools and techniques to impact mission critical projects
Network with experts, colleagues, and peers across industries “CEO on Stage”
Showcase for Startups
“I consider the GPU Technology Conference to be the single best place to see the amazing Tell your story to VCs and
work enabled by the GPU. It’s a great venue for meeting researchers, developers, scientists,
and entrepreneurs from around the world.” analysts
-- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker