This document describes a primitive processing and advanced shading architecture for embedded systems. It features a vertex cache and programmable primitive engine that can process fixed and variable size primitives with reduced memory bandwidth requirements. The architecture includes a configurable per-fragment shader that supports various shading models using dot products and lookup tables stored on-chip. This hybrid design aims to bring appealing shading to embedded applications while meeting limitations on gate size, power consumption, and memory traffic growth.