Video Thumbnail Selector

Combining Adversarial and Reinforcement Learning for
Video Thumbnail Selection
E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
2021 ACM International Conference
on Multimedia Retrieval

Outline
• Problem statement
• Related work
• Developed approach
• Experiments
• Conclusions
1

Problem statement
2
Video is everywhere!
• Captured by smart devices and instantly
shared online
• Constantly and rapidly increasing volumes
of video content on the Web
Hours of video content uploaded on
YouTube every minute
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps-
like-tiktok/1767354/ (left) & https://www.statista.com/ (right)

Problem statement
3
But how to spot what we are looking for in endless collections of video content?
Get a quick idea about a
video’s content by
checking its thumbnail!
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/

Goal of video thumbnail selection technologies
4
Analysis outcomes: a set of
representative video frames
“Select one or a few video frames
that provide a representative and
aesthetically-pleasing overview of
the video content”
Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009”
Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)

Related work
Early visual-based approaches: Use of hand-crafted rules about the optimal thumbnail, and
tailored features and mechanisms to assess video frames’ alignment with these rules
• Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity,
variance of luminance, scene steadiness, thematic relevance, absence of subtitles
• Main shortcoming: rule definition and features’ engineering are highly-complex tasks
Recent visual-based approaches: Target a few commonly-desired characteristics for a video
thumbnail, and exploit learning efficiency of deep network architectures
• Thumbnail selection associated with: learnable estimates about frames’ representativeness and
aesthetic quality (focusing also on faces), learnable classifications of good and bad frames
Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources
• Video thumbnail selection is associated with: extracted keywords from the video metadata, databases
with visually-similar content, latent representations of textual and audio data, textual user queries
5

Developed approach
High-level overview
6

Developed approach
Network architecture
• Thumbnail Selector
• Estimating frames’ aesthetic quality
• Estimating frames’ importance
• Fusing estimations and selecting a small set of
candidate thumbnails
• Thumbnail Evaluator
• Evaluating thumbnails’ aesthetic quality
• Evaluating thumbnails’ representativeness
• Fusing evaluations (rewards)
• Thumbnail Evaluator  Thumbnail Selector
• Using the overall reward for reinforcement learning
7

Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 1: Update Encoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in
the last hidden layer of the Discriminator”
LPrior: “information loss when using Encoder’s latent
space to represent the prior distribution defined by
the Variational Auto-Encoder”
8

Developed approach
• Step 2: Update Decoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in the
last hidden layer of the Discriminator”
LGEN: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature vectors
and the label (“1”) associated to the original video”
8

Developed approach
• Step 3: Update Discriminator based on:
LORIG = “difference between Discriminator’s output when
seeing the original feature vectors and the label (“1”)
associated to the original video”
LSUM: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature
vectors and the label (“0”) associated to the thumbnail-
based video summary”
8

Developed approach
• Step 4: Update Importance Estimator based
on the Episodic REINFORCE algorithm
8

Experiments
9
Datasets
• Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
• 50 videos of various genres (e.g. documentary, educational, historical, lecture)
• Video length: 46 sec. to 3.5 min.
• Annotation: keyframe-based video summaries (5 per video)
• Youtube (https://sites.google.com/site/vsummsite/download)
• 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web
• Video length: 9 sec. to 11 min.
• Annotation: keyframe-based video summaries (5 per video)

Experiments
10
Evaluation approach
• Ground-truth thumbnails: top-3 selected keyframes by human annotators
• Evaluation measures:
• Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail
• Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails
• Measure performance also when using only the top-1 selected keyframe by human
annotators as the ground-truth
• Run experiments on 10 different randomly-created splits of the used data (80% training;
20% testing) and report the average performance over these runs

Experiments
11
Performance comparisons using top-3 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 15.79% 32.51% 7.53% 17.94%
Mahasseni et al., (2017) - 7.80% - 11.34%
Song et al., (2016) - 11.72% - 16.47%
Gu et al., (2018) - 12.18% - 18.25%
Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00%
Proposed approach 31.00% 40.00% 15.00% 20.00%
B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991.
Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668.
H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6.
E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks
for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.

Experiments
12
Performance comparisons using top-1 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 6.36% 16.66% 4.23% 9.98%
Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75%
Proposed approach 17.00% 21.00% 10.00% 16.25%

Experiments
13
Ablation study
Thumbnail selection criteria OVP Youtube
Aesthetics
estimations
Represent.
estimations
Using top-3
human selections
Using top-1
human selections
Using top-3
human selections
Using top-1
human selections
Frame
picking
Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3
Baseline
(random)
- - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98
Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50
Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75
Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50
Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50
Proposed
approach
√ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25

Conclusions
14
• Deep network architecture for video thumbnail selection, trained by combining adversarial
and reinforcement learning
• Thumbnail selection relies on representativeness and aesthetic quality of video frames
• Representativeness measured by an adversarially-trained discriminator
• Aesthetic quality estimated by a pretrained Fully Convolutional Network
• An overall reward is used to train Thumbnail Selector via reinforcement learning
• Experiments on two benchmark datasets (OVP and Youtube):
• Showed the advanced performance of our method against other SoA video thumbnail selection or
summarization approaches
• Documented the importance of aesthetics for the video thumbnail selection task

Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/Video-Thumbnail-Selector
This work was supported by the EUs Horizon 2020 research and innovation programme under
grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1

Video Thumbnail Selector

Recommended

Recommended

More Related Content

Similar to Video Thumbnail Selector

Similar to Video Thumbnail Selector (20)

More from VasileiosMezaris

More from VasileiosMezaris (20)

Recently uploaded

Recently uploaded (20)

Video Thumbnail Selector