Course Project
Sequence to Sequence - Video to Text
Tensorflow Implementation (GitHub)
Chun-Ming Chang
twcmchang@gmail.com
2017-05-17
Updated at Jan 18, 2018
Video Caption Generation
Input:
Output: A woman is slicing a block of tofu
A sequence of frames
A sequence of words
Methodology
There are two known alternatives
1. 2-D CNN (visual information) + RNN (temporal information)
â—‹ Extract feature vectors from the existing CNNs
â—‹ Encode a frame sequence as a fixed-length vector using a
LSTM-based frame model
â—‹ Decode video feature vectors to a sentence by another
LSTM-based language model
*not adopt template-based sentence model here
2. 3-D CNN: extract spatio-temporal motion features at the same
time in conjunction with the attention mechanism to improve
performance
S2VT = Seq-to-Seq: Video to Text
A end-to-end sequence-to-sequence model
Details in S2VT
Encoding stage (no loss in this stage)
1. Top LSTM layer receives a sequence of frames, and encode them
2. Bottom LSTM layer receives the hidden representation from 1 and
concatenates it with null padded input tokens, and encode them
3. Stop encoding until all the frames are exhausted
Decoding stage
1. Feed the <BOS> tag to start decoding the hidden representations
in LSTM to a sentence
2. Maximize the log-likelihood of predicted output sentence, given
the hidden representation and the last seen word
3. Stop decoding until emit the <EOS> tag
BOS: begin of sentence; EOS: end of sentence
Strengths of S2VT
â—Ź Sequence-to-sequence:
Frames are fed sequentially and words are generated sequentially
â—‹ Allow variable-length input and output
â—‹ Learn the temporal structure of the video
â—‹ Learn a language model to generate a sentence with
grammatical and semantic correctness
â—Ź End-to-end:
The encoding and decoding of the frame and word representations
are jointly learned
Evaluation - BLEU@1
BLEU@1 = BP * Precision
Precision = correct words / candidate length
http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf
where c = candidate length, r = reference length
Tip 1: Schedule Sampling
â—Ź Training: use human-provided words (always correct)
â—Ź Testing : use model-generated words (probably correct)
⇒ Mismatch between training and testing
● Also known as “teacher forcing” effect during training
â—Ź Schedule sampling: probably use model-generated words during
training to reduce teacher forcing effect
https://arxiv.org/abs/1506.03099
Tip 2: Attention Mechanism
â—Ź Learn to weight the frame features non-uniformly
conditioned on previous word inputs
A woman is slicing a block of tofu
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
http://arxiv.org/abs/1502.03044
https://github.com/yunjey/show-attend-and-tell
â—Ź Q: to generate the next word, which part of video model should
pay more attention?
A: compute similarity between and all hidden representations in
encoding LSTMs
â—Ź Based on the similarities, compute the weighted sum of all
hidden representations as input to decoding LSTMs
Attention Mechanism (Implemented)
https://arxiv.org/pdf/1509.06664.pdf
http://www.aclweb.org/anthology/D15-1166
Tensorflow Implementation
â—Ź Input: given 4800-dim video features extracted from a CNN
â—Ź Output: a sequence of words
â—Ź Environment
â—‹ Python 3.4.2
â—‹ Tensorflow r1.0, numpy-0.9.0, argparse
â—Ź Best model
â—‹ S2VT without attention and schedule sampling*
â—‹ Trained over 2000 epochs (4 hours)
â—‹ Average BLEU score = 0.275
* limited data constraints the effect of attention
Take a look on my Tensorflow
implementation of S2VT
my GitHub
Summary
â—Ź In this project, I implemented a Sequence-to-Sequence model to
generate video captions, achieving BLEU@1= 0.275
● Implemented schedule sampling to reduce the effects of “teacher
forcing” but had no performance improvement in my experiments
â—Ź Implemented attention mechanisms to better utilize learnt visual
representations but also not bring out significant improvement
â—Ź Working on a larger dataset to further study the effectiveness of of
schedule sampling and attention mechanism

Video caption generation via seq-to-seq model (TensorFlow implementation)

  • 1.
    Course Project Sequence toSequence - Video to Text Tensorflow Implementation (GitHub) Chun-Ming Chang twcmchang@gmail.com 2017-05-17 Updated at Jan 18, 2018
  • 2.
    Video Caption Generation Input: Output:A woman is slicing a block of tofu A sequence of frames A sequence of words
  • 3.
    Methodology There are twoknown alternatives 1. 2-D CNN (visual information) + RNN (temporal information) â—‹ Extract feature vectors from the existing CNNs â—‹ Encode a frame sequence as a fixed-length vector using a LSTM-based frame model â—‹ Decode video feature vectors to a sentence by another LSTM-based language model *not adopt template-based sentence model here 2. 3-D CNN: extract spatio-temporal motion features at the same time in conjunction with the attention mechanism to improve performance
  • 4.
    S2VT = Seq-to-Seq:Video to Text A end-to-end sequence-to-sequence model
  • 5.
    Details in S2VT Encodingstage (no loss in this stage) 1. Top LSTM layer receives a sequence of frames, and encode them 2. Bottom LSTM layer receives the hidden representation from 1 and concatenates it with null padded input tokens, and encode them 3. Stop encoding until all the frames are exhausted Decoding stage 1. Feed the <BOS> tag to start decoding the hidden representations in LSTM to a sentence 2. Maximize the log-likelihood of predicted output sentence, given the hidden representation and the last seen word 3. Stop decoding until emit the <EOS> tag BOS: begin of sentence; EOS: end of sentence
  • 6.
    Strengths of S2VT â—ŹSequence-to-sequence: Frames are fed sequentially and words are generated sequentially â—‹ Allow variable-length input and output â—‹ Learn the temporal structure of the video â—‹ Learn a language model to generate a sentence with grammatical and semantic correctness â—Ź End-to-end: The encoding and decoding of the frame and word representations are jointly learned
  • 7.
    Evaluation - BLEU@1 BLEU@1= BP * Precision Precision = correct words / candidate length http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf where c = candidate length, r = reference length
  • 8.
    Tip 1: ScheduleSampling ● Training: use human-provided words (always correct) ● Testing : use model-generated words (probably correct) ⇒ Mismatch between training and testing ● Also known as “teacher forcing” effect during training ● Schedule sampling: probably use model-generated words during training to reduce teacher forcing effect https://arxiv.org/abs/1506.03099
  • 9.
    Tip 2: AttentionMechanism â—Ź Learn to weight the frame features non-uniformly conditioned on previous word inputs A woman is slicing a block of tofu Show, Attend and Tell: Neural Image Caption Generation with Visual Attention http://arxiv.org/abs/1502.03044 https://github.com/yunjey/show-attend-and-tell
  • 10.
    â—Ź Q: togenerate the next word, which part of video model should pay more attention? A: compute similarity between and all hidden representations in encoding LSTMs â—Ź Based on the similarities, compute the weighted sum of all hidden representations as input to decoding LSTMs Attention Mechanism (Implemented) https://arxiv.org/pdf/1509.06664.pdf http://www.aclweb.org/anthology/D15-1166
  • 11.
    Tensorflow Implementation â—Ź Input:given 4800-dim video features extracted from a CNN â—Ź Output: a sequence of words â—Ź Environment â—‹ Python 3.4.2 â—‹ Tensorflow r1.0, numpy-0.9.0, argparse â—Ź Best model â—‹ S2VT without attention and schedule sampling* â—‹ Trained over 2000 epochs (4 hours) â—‹ Average BLEU score = 0.275 * limited data constraints the effect of attention
  • 12.
    Take a lookon my Tensorflow implementation of S2VT my GitHub
  • 13.
    Summary ● In thisproject, I implemented a Sequence-to-Sequence model to generate video captions, achieving BLEU@1= 0.275 ● Implemented schedule sampling to reduce the effects of “teacher forcing” but had no performance improvement in my experiments ● Implemented attention mechanisms to better utilize learnt visual representations but also not bring out significant improvement ● Working on a larger dataset to further study the effectiveness of of schedule sampling and attention mechanism