This document reviews evaluation datasets for Twitter sentiment analysis, introducing a new dataset called STS-Gold. It discusses various supervised and unsupervised sentiment analysis approaches, highlighting the limitations of existing datasets focusing primarily on tweet-level evaluations. The findings suggest that while there is a correlation between vocabulary size and tweet count, dataset sparsity negatively impacts classification performance.