This document summarizes several action recognition datasets for human activities. It describes both single-label datasets that classify entire videos, as well as multi-label datasets that temporally localize actions within videos. It also categorizes datasets as generic, instructional, ego-centric, compositional, multi-view, or multi-modal depending on the type of activities and data modalities included. Several prominent multi-modal datasets are highlighted, such as PKU-MMD, NTU RGB+D, MMAct, and HOMAGE, which provide video alongside additional modalities like depth, infrared, audio, and sensor data.