Michael Gygli
ETH-CVL @ MediaEval 2016: Textual-Visual Embeddings and Video2GIF for Video Interestingness In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Arun B. Vasudevan, Michael Gygli, Anna Volokitin, Luc Van Gool
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_24.pdf
Video: https://youtu.be/8qe-NIPSD-4
Abstract: This paper presents the methods that underly our submission to the Predicting Media Interestingness Task at MediaEval 2016. Our contribution relies on two main approaches: (i) A similarity metric between image and text and (ii) a generic video highlight detector. In particular, we develop a method for learning the similarity of text and images, by projecting them into the same embedding space. This embedding allows to find video frames that are both, canonical and relevant w.r.t the title of the video. We present the result of different configurations and give insights into when our best performing method works well and where it has difficulties.
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
MediaEval 2016 - ETH-CVL: Textual-Visual Embeddings and Video2GIF for Video Interestingness
1. ETH-CVL @ MediaEval 2016: Textual-Visual
Embeddings and Video2GIF for Video Interestingness
Michael Gygli, PhD student @ Computer Vision Lab, ETH Zurich
Work with Arun Balajee Vasudevan, Anna Volokitin and Luc Van Gool
1
3. Finding the most interesting and relevant content
▪ Two dominant approaches
▪ Use of textual information (title, category) to obtain a video specific
model, often using web image priors
E.g. [Khosla et al. CVPR 13], [Liu et al. IJCAI 09], Song et al. CVPR 15]
▪ Supervised learning with generic features and large training set
E.g. [Potapov et al. ECCV 2014], [Sun et al. ECCV 2014], [Gygli et al.
CVPR 16]
▪ Some methods use both
E.g. [Liu et al. CVPR 15]
3
5. Multi-task deep visual-semantic embedding for video
thumbnail selection [Liu et al. CVPR 15]
▪ Use Bing image search data (query, image, # of clicks) to learn a
joint embedding space for images and text
▪ Compute frame relevance as cosine similarity between the query or
title embedding and the frame embedding
[Liu et al. CVPR 2015]
5Michael Gygli, PhD student @ ETH Zurich06/21/2016
6. Our contribution with improvements over Liu et al.
● Siamese network
● Text Embedding Model
○ Words encoded through word2vec
○ LSTM to obtain fixed length embedding
● Convolutional Neural Network model
○ Fine-tuned VGG-19
● Training Data based on learning a ranking of: (query+, image+, image-
(“cat”, , )
6
8. Video2GIF: Automatic Generation of Animated GIFs
from Video [Gygli et al. CVPR 16]
Approach
▪ Work with segments as units
▪ Obtained through change-point detection [Song et al. CVPR 15]
▪ Train a deep neural network for ranking segments
...
8
Example video
Highest
scoring
Lowest
scoring
Michael Gygli, PhD student @ ETH Zurich06/21/2016
9. Video2GIF: Automatic Generation of Animated GIFs
from Video
Approach
▪ Train a deep neural network for ranking
segments
▪ Built on C3D network [Tran et al. ICCV
2015]
▪ Objective: score positives higher than
negatives
h: scoring function
s+
: positive segment
s-
: negative segment
9Michael Gygli, PhD student @ ETH Zurich06/21/2016
10. Video2GIF: Dataset
10Michael Gygli, PhD student @ ETH Zurich06/21/2016
Available on github.com/gyglim/video2gif_dataset
▪ Large-scale training data: GIFs created from YouTube videos
▪ Align GIF back to video
▪ This part defines a positive example
▪ Assume non-selected parts are less interesting than selected part
11. Results
Task Run mAP
Image
1 0.1866
2 0.1952
3 0.1858
Video
1 0.1362
2 0.1574
Frame-based:
• Run 1: Visual-semantic embedding trained on Clickture dataset
• Run 2: As Run 1, with fine-tuning on development set
• Run 3: As Run 1, but trained on a larger subset of Clickture
Segment-based:
• Run 1: Video2GIF
• Run 2: Averaged score of Visual-semantic embedding and Video2GIF
12. Qualitative Results - Run 2
Title: Captives
Title: After Earth
Predicted best frame True best frame
Predicted best frame True best frame
13. More information
▪ Paper Video2GIF: Automatic Generation of Animated GIFs from Video.
M. Gygli, Y. Song, L. Cao, CVPR 2016
▪ Slides on Video Summarization as Subset Selection @ Tutorial on
Optimization Algorithms for Subset Selection and Summarization in Large
Data Sets, CVPR 2016
https://t.co/mQIpxMab3v
▪ Demo website for Video2GIF: http://video2gif.info/autogif
▪ Paper Multi-task deep visual-semantic embedding for video thumbnail
selection. W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, CVPR 2015