Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Sequence to Sequence ‒
Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue,
Raymond Mooney, Trevor Darrell, Kate Saenko
ICCV 2015
M2 Soichiro Murakami
10/14/16 1

Introduction
10/14/16 2
Video
Text

10/14/16 3
A monkey is pulling a dog’s tail and is chased by the dog.

Main contribution
• To propose a novel model, which learns to directly
map a sequence of frames to a sequence of words
10/14/16 4
General seq2seq model
a. handle a variable number of frames
b. learn and use the temporal structure
of the video
c. learn a language model to generate
natural and grammatical sentences.
Fig.1

Related work 1/2
• image caption [8, 40]
1. generate a fixed length vector representation of an image
2. decode this vector into a sequence of words
• FGM [36]
1. identify the semantic content (subject, verb, object, scene).
2. combine them with confidences from a language model using a
factor graph to infer the most likey tuple in the video.
3. generate a sentence based on a template.
• Mean Pool [39]
• LSTMs are used to generate video descriptions by pooling the
representations of individual frames.
10/14/16 5

Related work 2/2
• Temporal-Attention [43] (ICCV2015)
• employ a 3-D convnet model that incorporates spatiotemporal
motion features to extract dense trajectory features (HoG, HoF, MBH).
• use an attention mechanism that learns to weight the frame
features.
10/14/16 6

Approach 1/2
• 3.1 LSTM for sequence modeling
• 3.2 Sequence to sequence video to text
10/14/16 7
p(y1, ..., ym|x1, ..., xn)
seq. of video framesseq. of words
Fig. 2
concatenate
Zt: output of the second LSTM layer

Approach 2/2
• 3.3 Video and text representation
• RGB frames
• apply a CNN (pre-trained) to input images and provide the output of
the top layer as input to the LSTM units. (AlexNet, 16-layer VGG
model)
• Optical Flow
• first extract classical variational optical flow features[2].
• then create flow images and apply a CNN (pre-trained).
• Text
• embed words to a lower 500 dimensional space by applying a linear
transformation to the input data.
10/14/16 8
for the combined model.

Experimental Setup (1/3)
• Video description datasets
• Microsoft Video Description Corpus (MSVD)
• a collection of YouTube clips & single sentence descriptions from annotators.
• MPII Movie Description Dataset (MPII-MD)
• Hollywood movies & movie scripts and audio description data.
• Montreal Video Annotation Dataset (M-VAD)
• Hollywood movies & audio description data for the visually impaired.
ØThey used a single sentence as a target sentence for each video.
10/14/16 9

10/14/16 10
Table 1. Corpus Statistics
Example of MPII-MD
( A Dataset for Movie Description, Anna Rohrbach, Marcus
Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)

• Evaluation Metrics
• METEOR [7]
• METEOR compares exact token matches, stemmed tokens, paraphrase
matches, as well as semantically similar matches using WordNet synonyms.
• Experimental details of the models
• unroll the LSTM to a fixed 80 time steps during training.
• for longer videos, truncated the number of frames.
• for shorter videos, pad the remaining inputs with zeros.
• mini-batch size: up to 8 for AlexNet, up to 3 for flow model.
10/14/16 11

Results and Discussion ‒ MSVD dataset -
10/14/16 12
• S2VT AlexNet model on RGB video
frames achieves 27.9% METEOR.
• The low performance of the flow
model.
• Polysemous words
• playing a guitar
• playing golf

Results and Discussion ‒Movie description datasets-
10/14/16 13
• It was best to use dropout at the
inputs and outputs of both LSTM
layers.
• SMT [28]
• translate holistic video
representations to a single sentence.
• Visual-Labels [27]
• LSTM-based approach which uses no
temporal encoding, but more diverse
visual features, namely object
detectors, as well as activity and
scene classifiers.

Conclusion
• They construct descriptions using a sequence to sequence
model, where frames are first read sequentially and then
words are generated sequentially.
• Their model achieves state-of-the-art performance on the
MSVD dataset.
• For further information...
• https://www.cs.utexas.edu/~vsub/s2vt.html
10/14/16 16

Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (20)

Similar to Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Similar to Paper introduction: Sequence to Sequence - Video to Text (ICCV2015) (20)

Recently uploaded

Recently uploaded (20)

Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)