Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Visual Storytelling (NAACL 2016, Poster)

297 views

Published on

We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Visual Storytelling (NAACL 2016, Poster)

  1. 1. A black frisbee is sitting on top of a roof. A man playing soccer outside of a white house with a red door. The boy is throwing a soccer ball by the red door. A soccer ball is over a roof by a frisbee in a rain gutter. Two balls and a frisbee are on top of a roof. A discus got stuck up on the roof. Why not try getting it down with a soccer ball? Up the soccer ball goes. It didn't work so we tried a volley ball. Now the discus, soccer ball, and volleyball are all stuck on the roof. *Ting-Hao (Kenneth) Huang1, *Francis Ferraro2, Nasrin Mostafazadeh3, Ishan Misra1, Jacob Devlin6, Aishwarya Agrawal4, Ross Girshick5, Xiaodong He6, Pushmeet Kohli6, Dhruv Batra4, Larry Zitnick5, Devi Parikh5, Lucy Vanderwende6, Michel Galley6 and Margaret Mitchell6 1 Carnegie Mellon University, 2 Johns Hopkins University, 3 University of Rochester, 4 Virginia Tech, 5 Facebook AI Research, 6 Microsoft Research Stories ≠ Consecutive Captions ≠ Descriptive TextMotivation Text/Image Pairs (K) Vocab Size (K) Words/Sent. Web Ppl. (30B words) Brown (comparison only) 52.1 (text only) 47.7 20.8 194.0 DII Description-in- isolation 151.8 13.8 11.0 147.0 SIS Stories-in- sequence 252.9 18.2 10.2 116.0 Getting Humans to Tell Stories Peason’s r BLEU 0.08 SkipThoughts 0.18 METEOR 0.22 This is a picture of a family. This is a picture of a cake. This is a picture of a dog. This is a picture of a beach. This is a picture of a beach. The family gathered together for a meal. The food was delicious. The dog was excited to be there. The dog was enjoying the water. The dog was happy to be in the water. The family gathered together for a meal. The food was delicious. The dog was excited to be there. The kids were playing in the water. The boat was a little too much to drink. The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water. Greedy Stories -Dups +Grounded Caption Output A solid next move in Artificial Intelligence is to go beyond basic description of visual scenes towards human-like understanding of grounded event structure and subjective expression. We introduce the first dataset for sequential vision-to-language and explore how modeling concrete description as well as figurative and social language enables visual storytelling. Our data is at sind.ai. Get Better Stories with Uniqueness & Visually Grounded Constraints DIISIS Automatic Evaluation and Results See our paper for the description-in-sequence tiers (DIS) and more! We define 80-5-5-10 train-dev-validation-test splits for all three tiers. Data Analysis Beam = 10 Beam = 1 - Dups + Grounded DII 23.55 19.10 19.21 ---- SIS 23.13 27.76 30.11 31.42 All values are statistically significant (< 1e-5). Correlations of automatic scores against human judgments on 3K random SIS training stories. METEOR scores on the validation split, using a sequence-to-sequence NN with gated recurrent units. Conclusion Visual Storytelling Flickr Album Description for Images in Isolation & in Sequences Story 1 Storytelling Story 2 Story 3 Re-telling Preferred Photo Sequence Story 4 Story 5 Several strong baselines for the task of visual storytelling demonstrate that intelligent machines can now begin to generate inferential, conceptual, and evaluative language to share humanlike experience. METEOR serves as an automatic metric for evaluation, best correlated with human descriptions. Much more work to be done: Combining a fully grounded model with a model free to dream yields the best automatically generated stories to date.

×