Course Project
Sequence to Sequence - Video to Text
Tensorflow Implementation (GitHub)
Chun-Ming Chang
twcmchang@gmail.com
2017-05-17
Updated at Jan 18, 2018
Video Caption Generation
Input:
Output: A woman is slicing a block of tofu
A sequence of frames
A sequence of words
Methodology
There are two known alternatives
1. 2-D CNN (visual information) + RNN (temporal information)
○ Extract feature vectors from the existing CNNs
○ Encode a frame sequence as a fixed-length vector using a
LSTM-based frame model
○ Decode video feature vectors to a sentence by another
LSTM-based language model
*not adopt template-based sentence model here
2. 3-D CNN: extract spatio-temporal motion features at the same
time in conjunction with the attention mechanism to improve
performance
S2VT = Seq-to-Seq: Video to Text
A end-to-end sequence-to-sequence model
Details in S2VT
Encoding stage (no loss in this stage)
1. Top LSTM layer receives a sequence of frames, and encode them
2. Bottom LSTM layer receives the hidden representation from 1 and
concatenates it with null padded input tokens, and encode them
3. Stop encoding until all the frames are exhausted
Decoding stage
1. Feed the <BOS> tag to start decoding the hidden representations
in LSTM to a sentence
2. Maximize the log-likelihood of predicted output sentence, given
the hidden representation and the last seen word
3. Stop decoding until emit the <EOS> tag
BOS: begin of sentence; EOS: end of sentence
Strengths of S2VT
● Sequence-to-sequence:
Frames are fed sequentially and words are generated sequentially
○ Allow variable-length input and output
○ Learn the temporal structure of the video
○ Learn a language model to generate a sentence with
grammatical and semantic correctness
● End-to-end:
The encoding and decoding of the frame and word representations
are jointly learned
Evaluation - BLEU@1
BLEU@1 = BP * Precision
Precision = correct words / candidate length
http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf
where c = candidate length, r = reference length
Tip 1: Schedule Sampling
● Training: use human-provided words (always correct)
● Testing : use model-generated words (probably correct)
⇒ Mismatch between training and testing
● Also known as “teacher forcing” effect during training
● Schedule sampling: probably use model-generated words during
training to reduce teacher forcing effect
https://arxiv.org/abs/1506.03099
Tip 2: Attention Mechanism
● Learn to weight the frame features non-uniformly
conditioned on previous word inputs
A woman is slicing a block of tofu
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
http://arxiv.org/abs/1502.03044
https://github.com/yunjey/show-attend-and-tell
● Q: to generate the next word, which part of video model should
pay more attention?
A: compute similarity between and all hidden representations in
encoding LSTMs
● Based on the similarities, compute the weighted sum of all
hidden representations as input to decoding LSTMs
Attention Mechanism (Implemented)
https://arxiv.org/pdf/1509.06664.pdf
http://www.aclweb.org/anthology/D15-1166
Tensorflow Implementation
● Input: given 4800-dim video features extracted from a CNN
● Output: a sequence of words
● Environment
○ Python 3.4.2
○ Tensorflow r1.0, numpy-0.9.0, argparse
● Best model
○ S2VT without attention and schedule sampling*
○ Trained over 2000 epochs (4 hours)
○ Average BLEU score = 0.275
* limited data constraints the effect of attention
Take a look on my Tensorflow
implementation of S2VT
my GitHub
Summary
● In this project, I implemented a Sequence-to-Sequence model to
generate video captions, achieving BLEU@1= 0.275
● Implemented schedule sampling to reduce the effects of “teacher
forcing” but had no performance improvement in my experiments
● Implemented attention mechanisms to better utilize learnt visual
representations but also not bring out significant improvement
● Working on a larger dataset to further study the effectiveness of of
schedule sampling and attention mechanism

Video caption generation via seq-to-seq model (TensorFlow implementation)

  • 1.
    Course Project Sequence toSequence - Video to Text Tensorflow Implementation (GitHub) Chun-Ming Chang twcmchang@gmail.com 2017-05-17 Updated at Jan 18, 2018
  • 2.
    Video Caption Generation Input: Output:A woman is slicing a block of tofu A sequence of frames A sequence of words
  • 3.
    Methodology There are twoknown alternatives 1. 2-D CNN (visual information) + RNN (temporal information) ○ Extract feature vectors from the existing CNNs ○ Encode a frame sequence as a fixed-length vector using a LSTM-based frame model ○ Decode video feature vectors to a sentence by another LSTM-based language model *not adopt template-based sentence model here 2. 3-D CNN: extract spatio-temporal motion features at the same time in conjunction with the attention mechanism to improve performance
  • 4.
    S2VT = Seq-to-Seq:Video to Text A end-to-end sequence-to-sequence model
  • 5.
    Details in S2VT Encodingstage (no loss in this stage) 1. Top LSTM layer receives a sequence of frames, and encode them 2. Bottom LSTM layer receives the hidden representation from 1 and concatenates it with null padded input tokens, and encode them 3. Stop encoding until all the frames are exhausted Decoding stage 1. Feed the <BOS> tag to start decoding the hidden representations in LSTM to a sentence 2. Maximize the log-likelihood of predicted output sentence, given the hidden representation and the last seen word 3. Stop decoding until emit the <EOS> tag BOS: begin of sentence; EOS: end of sentence
  • 6.
    Strengths of S2VT ●Sequence-to-sequence: Frames are fed sequentially and words are generated sequentially ○ Allow variable-length input and output ○ Learn the temporal structure of the video ○ Learn a language model to generate a sentence with grammatical and semantic correctness ● End-to-end: The encoding and decoding of the frame and word representations are jointly learned
  • 7.
    Evaluation - BLEU@1 BLEU@1= BP * Precision Precision = correct words / candidate length http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf where c = candidate length, r = reference length
  • 8.
    Tip 1: ScheduleSampling ● Training: use human-provided words (always correct) ● Testing : use model-generated words (probably correct) ⇒ Mismatch between training and testing ● Also known as “teacher forcing” effect during training ● Schedule sampling: probably use model-generated words during training to reduce teacher forcing effect https://arxiv.org/abs/1506.03099
  • 9.
    Tip 2: AttentionMechanism ● Learn to weight the frame features non-uniformly conditioned on previous word inputs A woman is slicing a block of tofu Show, Attend and Tell: Neural Image Caption Generation with Visual Attention http://arxiv.org/abs/1502.03044 https://github.com/yunjey/show-attend-and-tell
  • 10.
    ● Q: togenerate the next word, which part of video model should pay more attention? A: compute similarity between and all hidden representations in encoding LSTMs ● Based on the similarities, compute the weighted sum of all hidden representations as input to decoding LSTMs Attention Mechanism (Implemented) https://arxiv.org/pdf/1509.06664.pdf http://www.aclweb.org/anthology/D15-1166
  • 11.
    Tensorflow Implementation ● Input:given 4800-dim video features extracted from a CNN ● Output: a sequence of words ● Environment ○ Python 3.4.2 ○ Tensorflow r1.0, numpy-0.9.0, argparse ● Best model ○ S2VT without attention and schedule sampling* ○ Trained over 2000 epochs (4 hours) ○ Average BLEU score = 0.275 * limited data constraints the effect of attention
  • 12.
    Take a lookon my Tensorflow implementation of S2VT my GitHub
  • 13.
    Summary ● In thisproject, I implemented a Sequence-to-Sequence model to generate video captions, achieving BLEU@1= 0.275 ● Implemented schedule sampling to reduce the effects of “teacher forcing” but had no performance improvement in my experiments ● Implemented attention mechanisms to better utilize learnt visual representations but also not bring out significant improvement ● Working on a larger dataset to further study the effectiveness of of schedule sampling and attention mechanism