Video Thumbnail Selector

Combining Adversarial and Reinforcement Learning for
Video Thumbnail Selection
E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
2021 ACM International Conference
on Multimedia Retrieval

Outline
• Problem statement
• Related work
• Developed approach
• Experiments
• Conclusions
1

Problem statement
2
Video is everywhere!
• Captured by smart devices and instantly
shared online
• Constantly and rapidly increasing volumes
of video content on the Web
Hours of video content uploaded on
YouTube every minute
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps-
like-tiktok/1767354/ (left) & https://www.statista.com/ (right)

Problem statement
3
But how to spot what we are looking for in endless collections of video content?
Get a quick idea about a
video’s content by
checking its thumbnail!
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/

Goal of video thumbnail selection technologies
4
Analysis outcomes: a set of
representative video frames
“Select one or a few video frames
that provide a representative and
aesthetically-pleasing overview of
the video content”
Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009”
Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)

Related work
Early visual-based approaches: Use of hand-crafted rules about the optimal thumbnail, and
tailored features and mechanisms to assess video frames’ alignment with these rules
• Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity,
variance of luminance, scene steadiness, thematic relevance, absence of subtitles
• Main shortcoming: rule definition and features’ engineering are highly-complex tasks
Recent visual-based approaches: Target a few commonly-desired characteristics for a video
thumbnail, and exploit learning efficiency of deep network architectures
• Thumbnail selection associated with: learnable estimates about frames’ representativeness and
aesthetic quality (focusing also on faces), learnable classifications of good and bad frames
Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources
• Video thumbnail selection is associated with: extracted keywords from the video metadata, databases
with visually-similar content, latent representations of textual and audio data, textual user queries
5

Developed approach
High-level overview
6

Developed approach
Network architecture
• Thumbnail Selector
• Estimating frames’ aesthetic quality
• Estimating frames’ importance
• Fusing estimations and selecting a small set of
candidate thumbnails
• Thumbnail Evaluator
• Evaluating thumbnails’ aesthetic quality
• Evaluating thumbnails’ representativeness
• Fusing evaluations (rewards)
• Thumbnail Evaluator  Thumbnail Selector
• Using the overall reward for reinforcement learning
7

Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 1: Update Encoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in
the last hidden layer of the Discriminator”
LPrior: “information loss when using Encoder’s latent
space to represent the prior distribution defined by
the Variational Auto-Encoder”
8

Developed approach
• Step 2: Update Decoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in the
last hidden layer of the Discriminator”
LGEN: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature vectors
and the label (“1”) associated to the original video”
8

Developed approach
• Step 3: Update Discriminator based on:
LORIG = “difference between Discriminator’s output when
seeing the original feature vectors and the label (“1”)
associated to the original video”
LSUM: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature
vectors and the label (“0”) associated to the thumbnail-
based video summary”
8

Developed approach
• Step 4: Update Importance Estimator based
on the Episodic REINFORCE algorithm
8

Experiments
9
Datasets
• Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
• 50 videos of various genres (e.g. documentary, educational, historical, lecture)
• Video length: 46 sec. to 3.5 min.
• Annotation: keyframe-based video summaries (5 per video)
• Youtube (https://sites.google.com/site/vsummsite/download)
• 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web
• Video length: 9 sec. to 11 min.
• Annotation: keyframe-based video summaries (5 per video)

Experiments
10
Evaluation approach
• Ground-truth thumbnails: top-3 selected keyframes by human annotators
• Evaluation measures:
• Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail
• Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails
• Measure performance also when using only the top-1 selected keyframe by human
annotators as the ground-truth
• Run experiments on 10 different randomly-created splits of the used data (80% training;
20% testing) and report the average performance over these runs

Experiments
11
Performance comparisons using top-3 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 15.79% 32.51% 7.53% 17.94%
Mahasseni et al., (2017) - 7.80% - 11.34%
Song et al., (2016) - 11.72% - 16.47%
Gu et al., (2018) - 12.18% - 18.25%
Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00%
Proposed approach 31.00% 40.00% 15.00% 20.00%
B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991.
Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668.
H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6.
E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks
for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.

Experiments
12
Performance comparisons using top-1 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 6.36% 16.66% 4.23% 9.98%
Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75%
Proposed approach 17.00% 21.00% 10.00% 16.25%

Experiments
13
Ablation study
Thumbnail selection criteria OVP Youtube
Aesthetics
estimations
Represent.
estimations
Using top-3
human selections
Using top-1
human selections
Using top-3
human selections
Using top-1
human selections
Frame
picking
Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3
Baseline
(random)
- - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98
Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50
Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75
Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50
Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50
Proposed
approach
√ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25

Conclusions
14
• Deep network architecture for video thumbnail selection, trained by combining adversarial
and reinforcement learning
• Thumbnail selection relies on representativeness and aesthetic quality of video frames
• Representativeness measured by an adversarially-trained discriminator
• Aesthetic quality estimated by a pretrained Fully Convolutional Network
• An overall reward is used to train Thumbnail Selector via reinforcement learning
• Experiments on two benchmark datasets (OVP and Youtube):
• Showed the advanced performance of our method against other SoA video thumbnail selection or
summarization approaches
• Documented the importance of aesthetics for the video thumbnail selection task

Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/Video-Thumbnail-Selector
This work was supported by the EUs Horizon 2020 research and innovation programme under
grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1

Video Thumbnail Selector

More Related Content

Similar to Video Thumbnail Selector

More from VasileiosMezaris

Recently uploaded

Video Thumbnail Selector