The document discusses recent developments in video transformers. It summarizes several recent works that employ spatial backbones like ViT or ResNet combined with temporal transformers for video classification. Examples mentioned include VTN, TimeSformer, STAM, and ViViT. The document also discusses common practices in video transformer inference, like using multiple clips/crops and averaging predictions. Design choices covered include number of frames, spatial dimensions, and multi-view inference techniques.