1) The document discusses object-region video transformers (ORViT) for video recognition. ORViT applies attention at both the patch and object levels. 2) ORViT considers three aspects of objects: the objects themselves, interactions between objects, and object dynamics over time. 3) Experimental results show ORViT outperforms baseline models on action recognition, compositional action recognition, and spatio-temporal action detection tasks. ORViT better captures object-level information and dynamics compared to patch-level attention alone.