SlideShare a Scribd company logo
1 of 16
Download to read offline
Sequence to Sequence ‒
Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue,
Raymond Mooney, Trevor Darrell, Kate Saenko
ICCV 2015
M2 Soichiro Murakami
10/14/16 1
Introduction
10/14/16 2
Video
Text
10/14/16 3
A monkey is pulling a dog’s tail and is chased by the dog.
Main contribution
• To propose a novel model, which learns to directly
map a sequence of frames to a sequence of words
10/14/16 4
General seq2seq model
a. handle a variable number of frames
b. learn and use the temporal structure
of the video
c. learn a language model to generate
natural and grammatical sentences.
Fig.1
Related work 1/2
• image caption [8, 40]
1. generate a fixed length vector representation of an image
2. decode this vector into a sequence of words
• FGM [36]
1. identify the semantic content (subject, verb, object, scene).
2. combine them with confidences from a language model using a
factor graph to infer the most likey tuple in the video.
3. generate a sentence based on a template.
• Mean Pool [39]
• LSTMs are used to generate video descriptions by pooling the
representations of individual frames.
10/14/16 5
Related work 2/2
• Temporal-Attention [43] (ICCV2015)
• employ a 3-D convnet model that incorporates spatiotemporal
motion features to extract dense trajectory features (HoG, HoF, MBH).
• use an attention mechanism that learns to weight the frame
features.
10/14/16 6
Approach 1/2
• 3.1 LSTM for sequence modeling
• 3.2 Sequence to sequence video to text
10/14/16 7
p(y1, ..., ym|x1, ..., xn)
seq. of video framesseq. of words
Fig. 2
concatenate
Zt: output of the second LSTM layer
Approach 2/2
• 3.3 Video and text representation
• RGB frames
• apply a CNN (pre-trained) to input images and provide the output of
the top layer as input to the LSTM units. (AlexNet, 16-layer VGG
model)
• Optical Flow
• first extract classical variational optical flow features[2].
• then create flow images and apply a CNN (pre-trained).
• Text
• embed words to a lower 500 dimensional space by applying a linear
transformation to the input data.
10/14/16 8
for the combined model.
Experimental Setup (1/3)
• Video description datasets
• Microsoft Video Description Corpus (MSVD)
• a collection of YouTube clips & single sentence descriptions from annotators.
• MPII Movie Description Dataset (MPII-MD)
• Hollywood movies & movie scripts and audio description data.
• Montreal Video Annotation Dataset (M-VAD)
• Hollywood movies & audio description data for the visually impaired.
ØThey used a single sentence as a target sentence for each video.
10/14/16 9
Experimental Setup (2/3)
10/14/16 10
Table 1. Corpus Statistics
Example of MPII-MD
( A Dataset for Movie Description, Anna Rohrbach, Marcus
Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)
Experimental Setup (3/3)
• Evaluation Metrics
• METEOR [7]
• METEOR compares exact token matches, stemmed tokens, paraphrase
matches, as well as semantically similar matches using WordNet synonyms.
• Experimental details of the models
• unroll the LSTM to a fixed 80 time steps during training.
• for longer videos, truncated the number of frames.
• for shorter videos, pad the remaining inputs with zeros.
• mini-batch size: up to 8 for AlexNet, up to 3 for flow model.
10/14/16 11
Results and Discussion ‒ MSVD dataset -
10/14/16 12
• S2VT AlexNet model on RGB video
frames achieves 27.9% METEOR.
• The low performance of the flow
model.
• Polysemous words
• playing a guitar
• playing golf
Results and Discussion ‒Movie description datasets-
10/14/16 13
• It was best to use dropout at the
inputs and outputs of both LSTM
layers.
• SMT [28]
• translate holistic video
representations to a single sentence.
• Visual-Labels [27]
• LSTM-based approach which uses no
temporal encoding, but more diverse
visual features, namely object
detectors, as well as activity and
scene classifiers.
10/14/16 14
10/14/16 15
Conclusion
• They construct descriptions using a sequence to sequence
model, where frames are first read sequentially and then
words are generated sequentially.
• Their model achieves state-of-the-art performance on the
MSVD dataset.
• For further information...
• https://www.cs.utexas.edu/~vsub/s2vt.html
10/14/16 16

More Related Content

What's hot

damaro.ppt
damaro.pptdamaro.ppt
damaro.pptVideoguy
 
On Complex Enumeration for Multiuser MIMO Vector Precoding
On Complex Enumeration for Multiuser MIMO Vector PrecodingOn Complex Enumeration for Multiuser MIMO Vector Precoding
On Complex Enumeration for Multiuser MIMO Vector PrecodingTSC University of Mondragon
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONniranjan kumar
 
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...Tomoki Koriyama
 
FPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCCFPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCCTELKOMNIKA JOURNAL
 
Simulating communication systems with MATLAB: An introduction
Simulating communication systems with MATLAB: An introductionSimulating communication systems with MATLAB: An introduction
Simulating communication systems with MATLAB: An introductionAniruddha Chandra
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi
 
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...IJERA Editor
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET Journal
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
HardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionHardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionDmytro Mishkin
 

What's hot (12)

Dcp project
Dcp projectDcp project
Dcp project
 
damaro.ppt
damaro.pptdamaro.ppt
damaro.ppt
 
On Complex Enumeration for Multiuser MIMO Vector Precoding
On Complex Enumeration for Multiuser MIMO Vector PrecodingOn Complex Enumeration for Multiuser MIMO Vector Precoding
On Complex Enumeration for Multiuser MIMO Vector Precoding
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
 
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
 
FPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCCFPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCC
 
Simulating communication systems with MATLAB: An introduction
Simulating communication systems with MATLAB: An introductionSimulating communication systems with MATLAB: An introduction
Simulating communication systems with MATLAB: An introduction
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
HardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionHardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image Description
 

Viewers also liked

CTSUM: extracting more certain summaries for news articles
CTSUM: extracting more certain summaries for news articlesCTSUM: extracting more certain summaries for news articles
CTSUM: extracting more certain summaries for news articlesAkihiko Watanabe
 
Chainer with natural language processing hands on
Chainer with natural language processing hands onChainer with natural language processing hands on
Chainer with natural language processing hands onOgushi Masaya
 
Deep Learning 勉強会 (Chapter 7-12)
Deep Learning 勉強会 (Chapter 7-12)Deep Learning 勉強会 (Chapter 7-12)
Deep Learning 勉強会 (Chapter 7-12)Ohsawa Goodfellow
 
バイオインフォマティクスで実験ノートを取ろう
バイオインフォマティクスで実験ノートを取ろうバイオインフォマティクスで実験ノートを取ろう
バイオインフォマティクスで実験ノートを取ろうMasahiro Kasahara
 
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)Ohsawa Goodfellow
 
ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@Yusuke Oda
 
Chainerの使い方と 自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と 自然言語処理への応用Yuya Unno
 
key to improving core competitive capacity 4 enterprise
key to improving core competitive capacity 4 enterprisekey to improving core competitive capacity 4 enterprise
key to improving core competitive capacity 4 enterpriseTrung Ngoc
 
Web syahid 1210651273
Web syahid 1210651273Web syahid 1210651273
Web syahid 1210651273Moh Syahid
 
How to make sure your money lasts as long as you do…
How to make sure your money lasts as long as you do…How to make sure your money lasts as long as you do…
How to make sure your money lasts as long as you do…sanlamuk
 
2006增刊目录
2006增刊目录2006增刊目录
2006增刊目录guest2bb2c
 
Reference: Mobile payment industry in china 2012-2015
Reference: Mobile payment industry in china 2012-2015 Reference: Mobile payment industry in china 2012-2015
Reference: Mobile payment industry in china 2012-2015 C. Keiko Funahashi
 
스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙ 실시간토토 실 시간배팅
스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙  실시간토토  실  시간배팅스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙  실시간토토  실  시간배팅
스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙ 실시간토토 실 시간배팅fdghjhj
 
The Fear of Running out of Money
The Fear of Running out of MoneyThe Fear of Running out of Money
The Fear of Running out of Moneywmgna
 
Wisdom From A Laugh 57, 58
Wisdom From A Laugh 57, 58Wisdom From A Laugh 57, 58
Wisdom From A Laugh 57, 58OH TEIK BIN
 

Viewers also liked (20)

Hello
HelloHello
Hello
 
CTSUM: extracting more certain summaries for news articles
CTSUM: extracting more certain summaries for news articlesCTSUM: extracting more certain summaries for news articles
CTSUM: extracting more certain summaries for news articles
 
Chainer with natural language processing hands on
Chainer with natural language processing hands onChainer with natural language processing hands on
Chainer with natural language processing hands on
 
Deep Learning 勉強会 (Chapter 7-12)
Deep Learning 勉強会 (Chapter 7-12)Deep Learning 勉強会 (Chapter 7-12)
Deep Learning 勉強会 (Chapter 7-12)
 
バイオインフォマティクスで実験ノートを取ろう
バイオインフォマティクスで実験ノートを取ろうバイオインフォマティクスで実験ノートを取ろう
バイオインフォマティクスで実験ノートを取ろう
 
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
 
RNNLM
RNNLMRNNLM
RNNLM
 
ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@
 
Chainerの使い方と 自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と 自然言語処理への応用
 
key to improving core competitive capacity 4 enterprise
key to improving core competitive capacity 4 enterprisekey to improving core competitive capacity 4 enterprise
key to improving core competitive capacity 4 enterprise
 
Web syahid 1210651273
Web syahid 1210651273Web syahid 1210651273
Web syahid 1210651273
 
Hgfhfh
HgfhfhHgfhfh
Hgfhfh
 
How to make sure your money lasts as long as you do…
How to make sure your money lasts as long as you do…How to make sure your money lasts as long as you do…
How to make sure your money lasts as long as you do…
 
Hap7 18 a
Hap7 18 aHap7 18 a
Hap7 18 a
 
2006增刊目录
2006增刊目录2006增刊目录
2006增刊目录
 
Reference: Mobile payment industry in china 2012-2015
Reference: Mobile payment industry in china 2012-2015 Reference: Mobile payment industry in china 2012-2015
Reference: Mobile payment industry in china 2012-2015
 
Easy but Difficult
Easy but DifficultEasy but Difficult
Easy but Difficult
 
스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙ 실시간토토 실 시간배팅
스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙  실시간토토  실  시간배팅스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙  실시간토토  실  시간배팅
스타토토⊙o⊙Wifi89,cOm(카톡: XaZa⊙o⊙ 실시간토토 실 시간배팅
 
The Fear of Running out of Money
The Fear of Running out of MoneyThe Fear of Running out of Money
The Fear of Running out of Money
 
Wisdom From A Laugh 57, 58
Wisdom From A Laugh 57, 58Wisdom From A Laugh 57, 58
Wisdom From A Laugh 57, 58
 

Similar to Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Enhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language EmbeddingEnhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language Embeddingivaderivader
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONcscpconf
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionVasileiosMezaris
 
Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)Chun-Min Chang
 
Future semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMFuture semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMKyuri Kim
 
161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMsJunho Cho
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...IJECEIAES
 
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...IJCSIS Research Publications
 
1 state of-the-art and trends in scalable video
1 state of-the-art and trends in scalable video1 state of-the-art and trends in scalable video
1 state of-the-art and trends in scalable videoYogananda Patnaik
 
Video Description using Deep Learning
Video Description using Deep LearningVideo Description using Deep Learning
Video Description using Deep LearningPranjalMahajan9
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotationIAEME Publication
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
MiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive Streaming
MiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive StreamingMiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive Streaming
MiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive StreamingAlpen-Adria-Universität
 
Optimal Repeated Frame Compensation Using Efficient Video Coding
Optimal Repeated Frame Compensation Using Efficient Video  CodingOptimal Repeated Frame Compensation Using Efficient Video  Coding
Optimal Repeated Frame Compensation Using Efficient Video CodingIOSR Journals
 
Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...
Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...
Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...Vignesh V Menon
 
AN EFFICIENT MODEL FOR VIDEO PREDICTION
AN EFFICIENT MODEL FOR VIDEO PREDICTIONAN EFFICIENT MODEL FOR VIDEO PREDICTION
AN EFFICIENT MODEL FOR VIDEO PREDICTIONIRJET Journal
 

Similar to Paper introduction: Sequence to Sequence - Video to Text (ICCV2015) (20)

Enhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language EmbeddingEnhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language Embedding
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)
 
Future semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMFuture semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTM
 
161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
 
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
 
1 state of-the-art and trends in scalable video
1 state of-the-art and trends in scalable video1 state of-the-art and trends in scalable video
1 state of-the-art and trends in scalable video
 
Video Description using Deep Learning
Video Description using Deep LearningVideo Description using Deep Learning
Video Description using Deep Learning
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotation
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
1829 1833
1829 18331829 1833
1829 1833
 
1829 1833
1829 18331829 1833
1829 1833
 
A04840107
A04840107A04840107
A04840107
 
MiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive Streaming
MiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive StreamingMiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive Streaming
MiPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive Streaming
 
Optimal Repeated Frame Compensation Using Efficient Video Coding
Optimal Repeated Frame Compensation Using Efficient Video  CodingOptimal Repeated Frame Compensation Using Efficient Video  Coding
Optimal Repeated Frame Compensation Using Efficient Video Coding
 
Sub1577
Sub1577Sub1577
Sub1577
 
Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...
Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...
Gain of Grain: A Film Grain Handling Toolchain for VVC-based Open Implementat...
 
AN EFFICIENT MODEL FOR VIDEO PREDICTION
AN EFFICIENT MODEL FOR VIDEO PREDICTIONAN EFFICIENT MODEL FOR VIDEO PREDICTION
AN EFFICIENT MODEL FOR VIDEO PREDICTION
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

  • 1. Sequence to Sequence ‒ Video to Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko ICCV 2015 M2 Soichiro Murakami 10/14/16 1
  • 3. 10/14/16 3 A monkey is pulling a dog’s tail and is chased by the dog.
  • 4. Main contribution • To propose a novel model, which learns to directly map a sequence of frames to a sequence of words 10/14/16 4 General seq2seq model a. handle a variable number of frames b. learn and use the temporal structure of the video c. learn a language model to generate natural and grammatical sentences. Fig.1
  • 5. Related work 1/2 • image caption [8, 40] 1. generate a fixed length vector representation of an image 2. decode this vector into a sequence of words • FGM [36] 1. identify the semantic content (subject, verb, object, scene). 2. combine them with confidences from a language model using a factor graph to infer the most likey tuple in the video. 3. generate a sentence based on a template. • Mean Pool [39] • LSTMs are used to generate video descriptions by pooling the representations of individual frames. 10/14/16 5
  • 6. Related work 2/2 • Temporal-Attention [43] (ICCV2015) • employ a 3-D convnet model that incorporates spatiotemporal motion features to extract dense trajectory features (HoG, HoF, MBH). • use an attention mechanism that learns to weight the frame features. 10/14/16 6
  • 7. Approach 1/2 • 3.1 LSTM for sequence modeling • 3.2 Sequence to sequence video to text 10/14/16 7 p(y1, ..., ym|x1, ..., xn) seq. of video framesseq. of words Fig. 2 concatenate Zt: output of the second LSTM layer
  • 8. Approach 2/2 • 3.3 Video and text representation • RGB frames • apply a CNN (pre-trained) to input images and provide the output of the top layer as input to the LSTM units. (AlexNet, 16-layer VGG model) • Optical Flow • first extract classical variational optical flow features[2]. • then create flow images and apply a CNN (pre-trained). • Text • embed words to a lower 500 dimensional space by applying a linear transformation to the input data. 10/14/16 8 for the combined model.
  • 9. Experimental Setup (1/3) • Video description datasets • Microsoft Video Description Corpus (MSVD) • a collection of YouTube clips & single sentence descriptions from annotators. • MPII Movie Description Dataset (MPII-MD) • Hollywood movies & movie scripts and audio description data. • Montreal Video Annotation Dataset (M-VAD) • Hollywood movies & audio description data for the visually impaired. ØThey used a single sentence as a target sentence for each video. 10/14/16 9
  • 10. Experimental Setup (2/3) 10/14/16 10 Table 1. Corpus Statistics Example of MPII-MD ( A Dataset for Movie Description, Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)
  • 11. Experimental Setup (3/3) • Evaluation Metrics • METEOR [7] • METEOR compares exact token matches, stemmed tokens, paraphrase matches, as well as semantically similar matches using WordNet synonyms. • Experimental details of the models • unroll the LSTM to a fixed 80 time steps during training. • for longer videos, truncated the number of frames. • for shorter videos, pad the remaining inputs with zeros. • mini-batch size: up to 8 for AlexNet, up to 3 for flow model. 10/14/16 11
  • 12. Results and Discussion ‒ MSVD dataset - 10/14/16 12 • S2VT AlexNet model on RGB video frames achieves 27.9% METEOR. • The low performance of the flow model. • Polysemous words • playing a guitar • playing golf
  • 13. Results and Discussion ‒Movie description datasets- 10/14/16 13 • It was best to use dropout at the inputs and outputs of both LSTM layers. • SMT [28] • translate holistic video representations to a single sentence. • Visual-Labels [27] • LSTM-based approach which uses no temporal encoding, but more diverse visual features, namely object detectors, as well as activity and scene classifiers.
  • 16. Conclusion • They construct descriptions using a sequence to sequence model, where frames are first read sequentially and then words are generated sequentially. • Their model achieves state-of-the-art performance on the MSVD dataset. • For further information... • https://www.cs.utexas.edu/~vsub/s2vt.html 10/14/16 16