This document summarizes several datasets for image captioning, video classification, action recognition, and temporal localization. It describes the purpose, collection process, annotation format, examples and references for datasets including MS COCO, Visual Genome, Flickr8K/30K, Kinetics, Charades, AVA, STAIR Captions and Actions. The datasets vary in scale from thousands to millions of images/videos and cover a wide range of tasks from image captioning to complex activity recognition.
This document summarizes several datasets for image captioning, video classification, action recognition, and temporal localization. It describes the purpose, collection process, annotation format, examples and references for datasets including MS COCO, Visual Genome, Flickr8K/30K, Kinetics, Charades, AVA, STAIR Captions and Actions. The datasets vary in scale from thousands to millions of images/videos and cover a wide range of tasks from image captioning to complex activity recognition.