Visual storytelling

Visual Storytelling
Ting-Hao Huang et al.
2017/10/3

Abstract
1. Introduce the first dataset for Sequential vision-to-language
2. SIND v.1 (Sequential Images Narrative Dataset)
3. 81743 unique photos in 20211 sequences
4. Establish string baselines
5. Move artificial intelligence from basic understandings of typical visual scenes
towards more and more human-like understanding of grounded event structure
and subjective expression

Table of content
1. Introduction
2. Motivation and Related Work
3. Dataset Construction
4. Data Analysis
5. Automatic Evaluation Metric
6. Baseline Experiments
7. Conclusion

Introduction
1. From description (concrete, literal) to narrative (abstract, further inference)
2. “Sitting next to each other” vs. “Having a good time”
3. Release three tiers of language for the same images
1) Descriptions of images-in-isolation (DII)
2) Descriptions of images-in-sequence (DIS)
3) Stories for images-in-sequence (SIS)

Motivation and Related Work
1. Image Captioning (2014,2015)
2. Question answering (2014)
3. Visual phrase (2011)
4. Vision Understanding (2013)
5. Visual Concepts (2015,2016)
Those works focus on direct, literal description od image content

Dataset Construction
1. Extracting Photos
1. Leverage the idea that “storyable” event tend to involve some form of
possession (John’s party; Shabnam’s visit)
2. Extract Flickr data with possessive dependency patterns (Standford CoreNLP)
3. Use WordNet3.0 to find out EVENT
4. Only include albums with 10 to 50 photos where all album photos are taken
within a 48-hour span

2. Crowdsourcing Stories In Sequence
1. 2-stage crowdsourcing
2. Storytelling : worker selects a subset of photos and writes a story about it
3. Re-telling :the worker writes a story based on one photo sequence generated
in the first stage

3. Crowdsourcing Descriptions of Images In Isolation & Images In
Sequence
1. Also collect descriptions of images-in-isolation and descriptions of images-in-
sequence
2. Follow the instructions for image captioning (MS COCO). Ex: describe all the
important parts
4. Data Post-processing
1. Replace name and identified named entities

Data analysis
1. Dataset includes 10117 Flickr albums with 210819 unique photos.
2. Use normalized pointwise mutual information to identify the words most closely
associated with each tier.

Automatic Evaluation Metric
1. Human judgment is the most reliable way to evaluate.
2. Compute pairwise correlation coefficients between automatic metrics and
human judgments (score from 1-5) on 3000 stories from SIS training set.
3. Automatic metrics : METROR, smoothed-BLUE and Skip-Thoughts
4. METEOR correlates best with human judgment.

Baseline Experiments
1. Use Sequence-to-Sequence recurrent neural net
2. Encode an image sequence by running an RNN over fc7 vectors of each image, in
reverse order.
3. Use Gated Recurrent Units (GRUs) for both the image encoder and story decoder
4. Initially, Beam search (size = 100); but there’s lots of repetitive sentences.
5. Greedy search significantly increase the story quality.
6. Same content word cannot be produced more than once within a given story.
7. Filter out some “visually grounded” words

The details of the training were:
1. Extract 4096-dim FC 7 features using VGG16 without fine tuning
2. The encoder reads over the 5 images in a sequence. The order of images are reversed (i.e., the
first image in the sequence is the last one read in, following what is commonly done for
machine translation. This is probably not important though).
3. The encoder and decoder are 1000 dimensional GRU (no weight sharing)
4. The target word embedding size is 250 dimension (i.e., the dimension when the word that was
just produced is fed into the decoder GRU).
5. The target vocab size is words that occur 3 or more times in the training. Other words are
mapped to UNK (there is a constraint in the decoder that UNK cannot be produced at test time,
however).
6. 0.5 dropout on the image FC7 input (i.e., 50% of the 4096-dim FC7 features are dropped out
before being fed into encoder GRU. This is probably not important).
7. 0.5 dropout on the decoder GRU layer before applying it to the output layer.
8. If the story model is co-trained with caption data, you should use a token in the encoder GRU to
indicate which type of output to produce.
9. It's analogous to machine translation sequence-to-sequence models.

Visual storytelling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Visual storytelling

Similar to Visual storytelling (20)

Recently uploaded

Recently uploaded (20)

Visual storytelling