This paper proposes a novel model that learns to directly map a sequence of video frames to a sequence of words to generate descriptions of videos. The model uses LSTMs to handle variable frame numbers, learn temporal structure, and generate natural language descriptions. It represents frames using CNN features from RGB images and optical flow, achieves state-of-the-art results on MSVD, and outperforms prior work that generated descriptions from pooled frame representations or fixed templates.
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
1. Sequence to Sequence ‒
Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue,
Raymond Mooney, Trevor Darrell, Kate Saenko
ICCV 2015
M2 Soichiro Murakami
10/14/16 1
4. Main contribution
• To propose a novel model, which learns to directly
map a sequence of frames to a sequence of words
10/14/16 4
General seq2seq model
a. handle a variable number of frames
b. learn and use the temporal structure
of the video
c. learn a language model to generate
natural and grammatical sentences.
Fig.1
5. Related work 1/2
• image caption [8, 40]
1. generate a fixed length vector representation of an image
2. decode this vector into a sequence of words
• FGM [36]
1. identify the semantic content (subject, verb, object, scene).
2. combine them with confidences from a language model using a
factor graph to infer the most likey tuple in the video.
3. generate a sentence based on a template.
• Mean Pool [39]
• LSTMs are used to generate video descriptions by pooling the
representations of individual frames.
10/14/16 5
6. Related work 2/2
• Temporal-Attention [43] (ICCV2015)
• employ a 3-D convnet model that incorporates spatiotemporal
motion features to extract dense trajectory features (HoG, HoF, MBH).
• use an attention mechanism that learns to weight the frame
features.
10/14/16 6
7. Approach 1/2
• 3.1 LSTM for sequence modeling
• 3.2 Sequence to sequence video to text
10/14/16 7
p(y1, ..., ym|x1, ..., xn)
seq. of video framesseq. of words
Fig. 2
concatenate
Zt: output of the second LSTM layer
8. Approach 2/2
• 3.3 Video and text representation
• RGB frames
• apply a CNN (pre-trained) to input images and provide the output of
the top layer as input to the LSTM units. (AlexNet, 16-layer VGG
model)
• Optical Flow
• first extract classical variational optical flow features[2].
• then create flow images and apply a CNN (pre-trained).
• Text
• embed words to a lower 500 dimensional space by applying a linear
transformation to the input data.
10/14/16 8
for the combined model.
9. Experimental Setup (1/3)
• Video description datasets
• Microsoft Video Description Corpus (MSVD)
• a collection of YouTube clips & single sentence descriptions from annotators.
• MPII Movie Description Dataset (MPII-MD)
• Hollywood movies & movie scripts and audio description data.
• Montreal Video Annotation Dataset (M-VAD)
• Hollywood movies & audio description data for the visually impaired.
ØThey used a single sentence as a target sentence for each video.
10/14/16 9
10. Experimental Setup (2/3)
10/14/16 10
Table 1. Corpus Statistics
Example of MPII-MD
( A Dataset for Movie Description, Anna Rohrbach, Marcus
Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)
11. Experimental Setup (3/3)
• Evaluation Metrics
• METEOR [7]
• METEOR compares exact token matches, stemmed tokens, paraphrase
matches, as well as semantically similar matches using WordNet synonyms.
• Experimental details of the models
• unroll the LSTM to a fixed 80 time steps during training.
• for longer videos, truncated the number of frames.
• for shorter videos, pad the remaining inputs with zeros.
• mini-batch size: up to 8 for AlexNet, up to 3 for flow model.
10/14/16 11
12. Results and Discussion ‒ MSVD dataset -
10/14/16 12
• S2VT AlexNet model on RGB video
frames achieves 27.9% METEOR.
• The low performance of the flow
model.
• Polysemous words
• playing a guitar
• playing golf
13. Results and Discussion ‒Movie description datasets-
10/14/16 13
• It was best to use dropout at the
inputs and outputs of both LSTM
layers.
• SMT [28]
• translate holistic video
representations to a single sentence.
• Visual-Labels [27]
• LSTM-based approach which uses no
temporal encoding, but more diverse
visual features, namely object
detectors, as well as activity and
scene classifiers.
16. Conclusion
• They construct descriptions using a sequence to sequence
model, where frames are first read sequentially and then
words are generated sequentially.
• Their model achieves state-of-the-art performance on the
MSVD dataset.
• For further information...
• https://www.cs.utexas.edu/~vsub/s2vt.html
10/14/16 16