This document discusses visual storytelling, which aims to go beyond basic image description and generate more narrative descriptions of image sequences. It introduces the SIND dataset, the first dataset for sequential vision-to-language tasks containing images from Flickr albums annotated with descriptions of single images, consecutive images, and full stories. Baseline experiments are conducted using sequence-to-sequence models to generate stories from image sequences, finding that additional constraints like preventing repeated words and incorporating visual grounding can improve performance compared to basic beam search as measured by automatic metrics correlated with human judgments. The work aims to evolve AI towards more human-like understanding of grounded event structures and subjective expression in narratives.