I will discuss the state of the art of video understanding, particularly its research and applications at Facebook. I will focus on two active areas: multimodality and time. Video is naturally multi-modal, offering great possibility for content understanding while also opening new doors like unsupervised and weakly-supervised learning at scale. Temporal representation remains a largely open problem; while we can describe a few seconds of video, there is no natural representation for a few minutes of video. I will discuss recent progress, the importance of these problems for applications, and what we hope to achieve.