Combining Adversarial and Reinforcement Learning for
Video Thumbnail Selection
E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
2021 ACM International Conference
on Multimedia Retrieval
Outline
• Problem statement
• Related work
• Developed approach
• Experiments
• Conclusions
1
Problem statement
2
Video is everywhere!
• Captured by smart devices and instantly
shared online
• Constantly and rapidly increasing volumes
of video content on the Web
Hours of video content uploaded on
YouTube every minute
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps-
like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
Problem statement
3
But how to spot what we are looking for in endless collections of video content?
Get a quick idea about a
video’s content by
checking its thumbnail!
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Goal of video thumbnail selection technologies
4
Analysis outcomes: a set of
representative video frames
“Select one or a few video frames
that provide a representative and
aesthetically-pleasing overview of
the video content”
Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009”
Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)
Related work
Early visual-based approaches: Use of hand-crafted rules about the optimal thumbnail, and
tailored features and mechanisms to assess video frames’ alignment with these rules
• Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity,
variance of luminance, scene steadiness, thematic relevance, absence of subtitles
• Main shortcoming: rule definition and features’ engineering are highly-complex tasks
Recent visual-based approaches: Target a few commonly-desired characteristics for a video
thumbnail, and exploit learning efficiency of deep network architectures
• Thumbnail selection associated with: learnable estimates about frames’ representativeness and
aesthetic quality (focusing also on faces), learnable classifications of good and bad frames
Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources
• Video thumbnail selection is associated with: extracted keywords from the video metadata, databases
with visually-similar content, latent representations of textual and audio data, textual user queries
5
Developed approach
High-level overview
6
Developed approach
Network architecture
• Thumbnail Selector
• Estimating frames’ aesthetic quality
• Estimating frames’ importance
• Fusing estimations and selecting a small set of
candidate thumbnails
• Thumbnail Evaluator
• Evaluating thumbnails’ aesthetic quality
• Evaluating thumbnails’ representativeness
• Fusing evaluations (rewards)
• Thumbnail Evaluator  Thumbnail Selector
• Using the overall reward for reinforcement learning
7
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 1: Update Encoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in
the last hidden layer of the Discriminator”
LPrior: “information loss when using Encoder’s latent
space to represent the prior distribution defined by
the Variational Auto-Encoder”
8
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 2: Update Decoder based on:
LRecon= “distance between original and reconstructed
feature vectors, based on a latent representation in the
last hidden layer of the Discriminator”
LGEN: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature vectors
and the label (“1”) associated to the original video”
8
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 3: Update Discriminator based on:
LORIG = “difference between Discriminator’s output when
seeing the original feature vectors and the label (“1”)
associated to the original video”
LSUM: “difference between Discriminator’s output when
seeing the thumbnail-based reconstructed feature
vectors and the label (“0”) associated to the thumbnail-
based video summary”
8
Developed approach
Learning objectives and pipeline
• We follow a step-wise learning approach
• Step 4: Update Importance Estimator based
on the Episodic REINFORCE algorithm
8
Experiments
9
Datasets
• Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
• 50 videos of various genres (e.g. documentary, educational, historical, lecture)
• Video length: 46 sec. to 3.5 min.
• Annotation: keyframe-based video summaries (5 per video)
• Youtube (https://sites.google.com/site/vsummsite/download)
• 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web
• Video length: 9 sec. to 11 min.
• Annotation: keyframe-based video summaries (5 per video)
Experiments
10
Evaluation approach
• Ground-truth thumbnails: top-3 selected keyframes by human annotators
• Evaluation measures:
• Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail
• Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails
• Measure performance also when using only the top-1 selected keyframe by human
annotators as the ground-truth
• Run experiments on 10 different randomly-created splits of the used data (80% training;
20% testing) and report the average performance over these runs
Experiments
11
Performance comparisons using top-3 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 15.79% 32.51% 7.53% 17.94%
Mahasseni et al., (2017) - 7.80% - 11.34%
Song et al., (2016) - 11.72% - 16.47%
Gu et al., (2018) - 12.18% - 18.25%
Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00%
Proposed approach 31.00% 40.00% 15.00% 20.00%
B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991.
Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668.
H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6.
E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks
for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.
Experiments
12
Performance comparisons using top-1 human selected keyframes as ground-truth
OVP Youtube
P@1 P@3 P@1 P@3
Baseline (random) 6.36% 16.66% 4.23% 9.98%
Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75%
Proposed approach 17.00% 21.00% 10.00% 16.25%
Experiments
13
Ablation study
Thumbnail selection criteria OVP Youtube
Aesthetics
estimations
Represent.
estimations
Using top-3
human selections
Using top-1
human selections
Using top-3
human selections
Using top-1
human selections
Frame
picking
Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3
Baseline
(random)
- - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98
Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50
Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75
Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50
Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50
Proposed
approach
√ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25
Conclusions
14
• Deep network architecture for video thumbnail selection, trained by combining adversarial
and reinforcement learning
• Thumbnail selection relies on representativeness and aesthetic quality of video frames
• Representativeness measured by an adversarially-trained discriminator
• Aesthetic quality estimated by a pretrained Fully Convolutional Network
• An overall reward is used to train Thumbnail Selector via reinforcement learning
• Experiments on two benchmark datasets (OVP and Youtube):
• Showed the advanced performance of our method against other SoA video thumbnail selection or
summarization approaches
• Documented the importance of aesthetics for the video thumbnail selection task
Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/Video-Thumbnail-Selector
This work was supported by the EUs Horizon 2020 research and innovation programme under
grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1

Video Thumbnail Selector

  • 1.
    Combining Adversarial andReinforcement Learning for Video Thumbnail Selection E. Apostolidis1,2, E. Adamantidou1, V. Mezaris1, I. Patras2 1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 2021 ACM International Conference on Multimedia Retrieval
  • 2.
    Outline • Problem statement •Related work • Developed approach • Experiments • Conclusions 1
  • 3.
    Problem statement 2 Video iseverywhere! • Captured by smart devices and instantly shared online • Constantly and rapidly increasing volumes of video content on the Web Hours of video content uploaded on YouTube every minute Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-sharing-apps- like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
  • 4.
    Problem statement 3 But howto spot what we are looking for in endless collections of video content? Get a quick idea about a video’s content by checking its thumbnail! Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
  • 5.
    Goal of videothumbnail selection technologies 4 Analysis outcomes: a set of representative video frames “Select one or a few video frames that provide a representative and aesthetically-pleasing overview of the video content” Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009” Video source: OVP dataset (video also online available at: https://www.youtube.com/watch?v=deRF9oEbRso)
  • 6.
    Related work Early visual-basedapproaches: Use of hand-crafted rules about the optimal thumbnail, and tailored features and mechanisms to assess video frames’ alignment with these rules • Thumbnail selection associated with: appearance and positioning of faces/objects, color diversity, variance of luminance, scene steadiness, thematic relevance, absence of subtitles • Main shortcoming: rule definition and features’ engineering are highly-complex tasks Recent visual-based approaches: Target a few commonly-desired characteristics for a video thumbnail, and exploit learning efficiency of deep network architectures • Thumbnail selection associated with: learnable estimates about frames’ representativeness and aesthetic quality (focusing also on faces), learnable classifications of good and bad frames Recent multimodal approaches: Exploit data from additional modalities or auxiliary sources • Video thumbnail selection is associated with: extracted keywords from the video metadata, databases with visually-similar content, latent representations of textual and audio data, textual user queries 5
  • 7.
  • 8.
    Developed approach Network architecture •Thumbnail Selector • Estimating frames’ aesthetic quality • Estimating frames’ importance • Fusing estimations and selecting a small set of candidate thumbnails • Thumbnail Evaluator • Evaluating thumbnails’ aesthetic quality • Evaluating thumbnails’ representativeness • Fusing evaluations (rewards) • Thumbnail Evaluator  Thumbnail Selector • Using the overall reward for reinforcement learning 7
  • 9.
    Developed approach Learning objectivesand pipeline • We follow a step-wise learning approach • Step 1: Update Encoder based on: LRecon= “distance between original and reconstructed feature vectors, based on a latent representation in the last hidden layer of the Discriminator” LPrior: “information loss when using Encoder’s latent space to represent the prior distribution defined by the Variational Auto-Encoder” 8
  • 10.
    Developed approach Learning objectivesand pipeline • We follow a step-wise learning approach • Step 2: Update Decoder based on: LRecon= “distance between original and reconstructed feature vectors, based on a latent representation in the last hidden layer of the Discriminator” LGEN: “difference between Discriminator’s output when seeing the thumbnail-based reconstructed feature vectors and the label (“1”) associated to the original video” 8
  • 11.
    Developed approach Learning objectivesand pipeline • We follow a step-wise learning approach • Step 3: Update Discriminator based on: LORIG = “difference between Discriminator’s output when seeing the original feature vectors and the label (“1”) associated to the original video” LSUM: “difference between Discriminator’s output when seeing the thumbnail-based reconstructed feature vectors and the label (“0”) associated to the thumbnail- based video summary” 8
  • 12.
    Developed approach Learning objectivesand pipeline • We follow a step-wise learning approach • Step 4: Update Importance Estimator based on the Episodic REINFORCE algorithm 8
  • 13.
    Experiments 9 Datasets • Open VideoProject (OVP) (https://sites.google.com/site/vsummsite/download) • 50 videos of various genres (e.g. documentary, educational, historical, lecture) • Video length: 46 sec. to 3.5 min. • Annotation: keyframe-based video summaries (5 per video) • Youtube (https://sites.google.com/site/vsummsite/download) • 50 videos of diverse content (e.g. news, TV-shows, sports, commercials) collected from the Web • Video length: 9 sec. to 11 min. • Annotation: keyframe-based video summaries (5 per video)
  • 14.
    Experiments 10 Evaluation approach • Ground-truththumbnails: top-3 selected keyframes by human annotators • Evaluation measures: • Precision at 1 (P@1): matching ground-truth with top-1 machine-selected thumbnail • Precision at 3 (P@3): matching ground-truth with top-3 machine-selected thumbnails • Measure performance also when using only the top-1 selected keyframe by human annotators as the ground-truth • Run experiments on 10 different randomly-created splits of the used data (80% training; 20% testing) and report the average performance over these runs
  • 15.
    Experiments 11 Performance comparisons usingtop-3 human selected keyframes as ground-truth OVP Youtube P@1 P@3 P@1 P@3 Baseline (random) 15.79% 32.51% 7.53% 17.94% Mahasseni et al., (2017) - 7.80% - 11.34% Song et al., (2016) - 11.72% - 16.47% Gu et al., (2018) - 12.18% - 18.25% Apostolidis et al., (2021) 15.00% 24.00% 8.75% 15.00% Proposed approach 31.00% 40.00% 15.00% 20.00% B. Mahasseni, et al., (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. Proc. CVPR 2017, pp. 2982–2991. Y. Song, et al., (2016). To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. Proc. CIKM ’16, pp. 659–668. H. Gu et al., (2018). From Thumbnails to Summaries - A Single Deep Neural Network to Rule Them All. Proc. ICME 2018 , pp. 1–6. E. Apostolidis, et al., (2021). AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization. IEEE Trans. CSVT , vol. 31, no. 8, pp. 3278-3292, Aug. 2021.
  • 16.
    Experiments 12 Performance comparisons usingtop-1 human selected keyframes as ground-truth OVP Youtube P@1 P@3 P@1 P@3 Baseline (random) 6.36% 16.66% 4.23% 9.98% Apostolidis et al., (2021) 7.00% 14.00% 6.25% 8.75% Proposed approach 17.00% 21.00% 10.00% 16.25%
  • 17.
    Experiments 13 Ablation study Thumbnail selectioncriteria OVP Youtube Aesthetics estimations Represent. estimations Using top-3 human selections Using top-1 human selections Using top-3 human selections Using top-1 human selections Frame picking Reward P@1 P@3 P@1 P@3 P@1 P@3 P@1 P@3 Baseline (random) - - - 15.79 32.51 6.36 16.66 7.53 17.94 4.23 9.98 Variant #1 √ √ X 16.00 20.00 8.00 12.00 6.00 17.50 5.00 7.50 Variant #2 X X √ 20.00 30.00 8.00 13.00 10.00 18.75 3.75 8.75 Variant #3 √ X √ 12.00 36.00 3.00 18.00 10.00 18.75 6.25 12.50 Variant #4 X √ √ 30.00 39.00 18.00 23.00 13.75 16.25 10.00 12.50 Proposed approach √ √ √ 31.00 40.00 17.00 21.00 15.00 20.00 10.00 16.25
  • 18.
    Conclusions 14 • Deep networkarchitecture for video thumbnail selection, trained by combining adversarial and reinforcement learning • Thumbnail selection relies on representativeness and aesthetic quality of video frames • Representativeness measured by an adversarially-trained discriminator • Aesthetic quality estimated by a pretrained Fully Convolutional Network • An overall reward is used to train Thumbnail Selector via reinforcement learning • Experiments on two benchmark datasets (OVP and Youtube): • Showed the advanced performance of our method against other SoA video thumbnail selection or summarization approaches • Documented the importance of aesthetics for the video thumbnail selection task
  • 19.
    Thank you foryour attention! Questions? Evlampios Apostolidis, apostolid@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/Video-Thumbnail-Selector This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1