Video caption generation via seq-to-seq model (TensorFlow implementation)

Course Project
Sequence to Sequence - Video to Text
Tensorflow Implementation (GitHub)
Chun-Ming Chang
twcmchang@gmail.com
2017-05-17
Updated at Jan 18, 2018

Video Caption Generation
Input:
Output: A woman is slicing a block of tofu
A sequence of frames
A sequence of words

Methodology
There are two known alternatives
1. 2-D CNN (visual information) + RNN (temporal information)
○ Extract feature vectors from the existing CNNs
○ Encode a frame sequence as a fixed-length vector using a
LSTM-based frame model
○ Decode video feature vectors to a sentence by another
LSTM-based language model
*not adopt template-based sentence model here
2. 3-D CNN: extract spatio-temporal motion features at the same
time in conjunction with the attention mechanism to improve
performance

S2VT = Seq-to-Seq: Video to Text
A end-to-end sequence-to-sequence model

Details in S2VT
Encoding stage (no loss in this stage)
1. Top LSTM layer receives a sequence of frames, and encode them
2. Bottom LSTM layer receives the hidden representation from 1 and
concatenates it with null padded input tokens, and encode them
3. Stop encoding until all the frames are exhausted
Decoding stage
1. Feed the <BOS> tag to start decoding the hidden representations
in LSTM to a sentence
2. Maximize the log-likelihood of predicted output sentence, given
the hidden representation and the last seen word
3. Stop decoding until emit the <EOS> tag
BOS: begin of sentence; EOS: end of sentence

Strengths of S2VT
● Sequence-to-sequence:
Frames are fed sequentially and words are generated sequentially
○ Allow variable-length input and output
○ Learn the temporal structure of the video
○ Learn a language model to generate a sentence with
grammatical and semantic correctness
● End-to-end:
The encoding and decoding of the frame and word representations
are jointly learned

Evaluation - BLEU@1
BLEU@1 = BP * Precision
Precision = correct words / candidate length
http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf
where c = candidate length, r = reference length

Tip 1: Schedule Sampling
● Training: use human-provided words (always correct)
● Testing : use model-generated words (probably correct)
⇒ Mismatch between training and testing
● Also known as “teacher forcing” effect during training
● Schedule sampling: probably use model-generated words during
training to reduce teacher forcing effect
https://arxiv.org/abs/1506.03099

Tip 2: Attention Mechanism
● Learn to weight the frame features non-uniformly
conditioned on previous word inputs
A woman is slicing a block of tofu
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
http://arxiv.org/abs/1502.03044
https://github.com/yunjey/show-attend-and-tell

● Q: to generate the next word, which part of video model should
pay more attention?
A: compute similarity between and all hidden representations in
encoding LSTMs
● Based on the similarities, compute the weighted sum of all
hidden representations as input to decoding LSTMs
Attention Mechanism (Implemented)
https://arxiv.org/pdf/1509.06664.pdf
http://www.aclweb.org/anthology/D15-1166

Tensorflow Implementation
● Input: given 4800-dim video features extracted from a CNN
● Output: a sequence of words
● Environment
○ Python 3.4.2
○ Tensorflow r1.0, numpy-0.9.0, argparse
● Best model
○ S2VT without attention and schedule sampling*
○ Trained over 2000 epochs (4 hours)
○ Average BLEU score = 0.275
* limited data constraints the effect of attention

Take a look on my Tensorflow
implementation of S2VT
my GitHub

Summary
● In this project, I implemented a Sequence-to-Sequence model to
generate video captions, achieving BLEU@1= 0.275
● Implemented schedule sampling to reduce the effects of “teacher
forcing” but had no performance improvement in my experiments
● Implemented attention mechanisms to better utilize learnt visual
representations but also not bring out significant improvement
● Working on a larger dataset to further study the effectiveness of of
schedule sampling and attention mechanism

Video caption generation via seq-to-seq model (TensorFlow implementation)

More Related Content

What's hot

Similar to Video caption generation via seq-to-seq model (TensorFlow implementation)

Recently uploaded

Video caption generation via seq-to-seq model (TensorFlow implementation)