GAN-based video summarization

Thessaloniki, October 2020
GAN-based Video Summarization
Vasileios Mezaris
CERTH-ITI
Presentation at the AI4Media
Workshop on GANs for Media
Content Generation
1
Joint work with
E. Apostolidis, E. Adamantidou,
A. Metsai (CERTH-ITI);
I. Patras (QMUL)
Thessaloniki, October 2020Vasileios Mezaris
2
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem statement
Thessaloniki, October 2020Vasileios Mezaris
3
Problem statement
Applications of video summarization
 Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets
 Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption
 Other summarization scenarios: movie trailer production, sports highlights video generation,
video synopsis of 24h surveillance recordings
Thessaloniki, October 2020Vasileios Mezaris
4
Related work
Deep-learning approaches
 Various supervised methods (i.e., learning from ground-truth manually-generated summaries)
 Using feedforward neural nets (CNNs) for e.g. identifying semantically-important video parts
 Exploiting video-level metadata
 Capturing the story flow using recurrent neural nets (e.g. LSTMs)
 …and many more
 Unsupervised algorithms that do not rely on human-annotations, and build summaries
 Using adversarial learning to: minimize the distance between videos and their summary-based
reconstructions; maximize the mutual information between summary and video; learn a mapping
from raw videos to human-like summaries based on online available summaries
 …and a few more approaches (see tutorial at IEEE ICME 2020,
https://www.slideshare.net/VasileiosMezaris/icme2020-tutorial-videosummarizationpart1)
+ No need for training data (limited, hard to produce)
+ Avoid the subjectivity & biases of manually-generated summaries
+ Adaptability to different types of video
Thessaloniki, October 2020Vasileios Mezaris
GANs for unsupervised video summarization
 Our starting point: the SUM-GAN architecture [1]
 Main idea: build a keyframe selection mechanism
by minimizing the distance between the deep
representations of the original video and a
reconstructed version of it based on the selected
keyframes
 Problem: how to define a good distance?
 Solution: use a trainable discriminator network!
 Goal: train the Summarizer to maximally confuse
the Discriminator when distinguishing the original
from the reconstructed video
5
SUM-GAN
[1] B. Mahasseni, M. Lam, S. Todorovic, "Unsupervised Video
Summarization with Adversarial LSTM Networks“, 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
2982–2991.
Thessaloniki, October 2020Vasileios Mezaris
 Introduces two extensions [2]:
 A linear compression layer that reduces the size
of the CNN feature vectors
 An incremental and fine-grained approach to
train the model’s components
[2] E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-
based Approach for Improving the Adversarial Training in Unsupervised Video
Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
6
SUM-GAN-sl
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris
 Incremental approach to train the model’s components
7
SUM-GAN-sl
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris 8
(regularization factor)
SUM-GAN-sl
GANs for unsupervised video summarization
 Incremental approach to train the model’s components
Thessaloniki, October 2020Vasileios Mezaris 9
SUM-GAN-sl
GANs for unsupervised video summarization
 Incremental approach to train the model’s components
Thessaloniki, October 2020Vasileios Mezaris 10
SUM-GAN-sl
GANs for unsupervised video summarization
 Incremental approach to train the model’s components
Thessaloniki, October 2020Vasileios Mezaris
 Adversarial learning driven by deterministic
attention auto-encoder
 The VAE in previous architecture was entirely
replaced by an attention auto-encoder (AAE)
network, forming the SUM-GAN-AAE
architecture [3]
[3] E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised
Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int.
Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.
11
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris 12
Attention auto-encoder
Processing pipeline
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris 13
Processing pipeline
 Weighted feature vectors fed to the Encoder
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris 14
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 For t > 1: use the hidden state of the previous
Decoder’s step (h1)
 For t = 1: use the hidden state of the last
Encoder’s step (He)
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris 15
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
16
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
17
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
 vt’ combined with Decoder’s previous output yt-1
18
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris 19
Attention auto-encoder
Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
 vt’ combined with Decoder’s previous output yt-1
 Decoder gradually reconstructs the video
SUM-GAN-AAE
GANs for unsupervised video summarization
Thessaloniki, October 2020Vasileios Mezaris
Video summarization practicalities
 Input: The CNN feature vectors of the (sampled) video frames
 Output: Frame-level importance scores
 Summarization process:
 CNN features pass through the linear compression layer and the frame selector  importance
scores computed at frame-level
 Given a video segmentation (using KTS) calculate fragment-level importance scores by averaging
the scores of each fragment's frames
 Summary is created by selecting the fragments that maximize the total importance score provided
that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem
20
Model’s I/O and summarization process
Thessaloniki, October 2020Vasileios Mezaris
Experiments
21
Datasets
 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
 25 videos capturing multiple events (e.g. cooking and sports)
 video length: 1 to 6 min
 annotation: fragment-based video summaries
 TVSum (https://github.com/yalesong/tvsum)
 50 videos from 10 categories of TRECVid MED task
 video length: 1 to 11 min
 annotation: frame-level importance scores
Thessaloniki, October 2020Vasileios Mezaris
Experiments
22
Evaluation protocol
 The generated summary should not exceed 15% of the video length
 Similarity between automatically generated (A) and ground-truth (G) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
 Typical metrics for computing Precision and Recall at the frame-level
Thessaloniki, October 2020Vasileios Mezaris
Experiments
23
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach in the literature
Thessaloniki, October 2020Vasileios Mezaris
Experiments
24
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach in the literature
Thessaloniki, October 2020Vasileios Mezaris
Experiments
25
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach in the literature
F-Score1
Thessaloniki, October 2020Vasileios Mezaris
Experiments
26
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach in the literature
F-Score2
F-Score1
Thessaloniki, October 2020Vasileios Mezaris
Experiments
27
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach in the literature
F-ScoreN
F-Score2
F-Score1
Thessaloniki, October 2020Vasileios Mezaris
Experiments
28
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach in the literature
F-ScoreN
F-Score2
F-Score1
SumMe: TVSum:
N
Thessaloniki, October 2020Vasileios Mezaris
Experiments
29
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Alternative approach
Thessaloniki, October 2020Vasileios Mezaris
Experiments
30
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Alternative approach
F-Score
Thessaloniki, October 2020Vasileios Mezaris
 Videos were down-sampled to 2 fps
 Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet
 Linear compression layer reduces the size of these vectors from 1024 to 500
 All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM
 Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s
learning rate = 10-5
 Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing
set having the remaining 20% of data
 Ran experiments on 5 differently created random splits and report the average performance at
the training-epoch-level (i.e. for the same training epoch) over these runs
Experiments
31
Implementation details
Thessaloniki, October 2020Vasileios Mezaris
 Comparison with SoA unsupervised approaches based on multiple user summaries
 Outcomes
 A few SoA methods are comparable (or even worse) with a random summary generator
 Best method on TVSum shows random-level performance on SumMe
 Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum
 Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two
latent spaces in parallel to the continuous update of the model's components during the training
 Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl
Experiments
32
Note: SUM-GAN is not listed in this table as it follows
the single gt-summary evaluation protocol
Thessaloniki, October 2020Vasileios Mezaris
 Evaluating the effect of the AAE component
 Training efficiency: much faster and more stable training of the model
Experiments
33
Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
Thessaloniki, October 2020Vasileios Mezaris
 Comparison with SoA supervised approaches based on multiple user summaries
 Outcomes
 Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as
they exhibit random-level performance on SumMe
 Only a few supervised methods surpass the performance of a random summary generator on both
datasets, with VASNet being the best among them
 The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum
 Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods
Experiments
34
+/- indicate
better/worse
performance
compared to
SUM-GAN-AAE
Thessaloniki, October 2020Vasileios Mezaris
Adapting / re-purposing the content
 Main requirements:
 Target distribution platforms & devices have varying requirements (e.g. the optimal
duration of a video differs from one platform to another)
 Target audiences have different preferences / information needs
 Video summarization:
 Create editions of the content that are adapted to different platforms and audiences
35
Thessaloniki, October 2020Vasileios Mezaris
Adapting / re-purposing the content
Web application [4] for video summarization (try it with your video!):
http://multimedia2.iti.gr/videosummarization/service/start.html
Demo video:
https://youtu.be/LbjPLJzeNII
36
[4] C. Collyda, K. Apostolidis, E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, "A
Web Service for Video Summarization", Proc. ACM Int. Conf. on Interactive Media
Experiences (IMX 2020), Barcelona, Spain, June 2020.
Thessaloniki, October 2020Vasileios Mezaris
 Presented two new video summarization methods, making use of:
 The learning efficiency of the generative adversarial networks for unsupervised training
 The effectiveness of attention mechanisms in spotting the most important parts of the video
 Experimental evaluations on two benchmarking datasets
 Documented the positive contribution of the introduced attention auto-encoder component in the
model's training and summarization performance
 Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video
summarization techniques
 Used GANs in a new web application for video summarization
 Keep in mind: complete automation is sometimes not desired! (AI + human symbiosis is key)
Conclusions
37
Thessaloniki, October 2020Vasileios Mezaris
Questions?
38
Contact: Dr. Vasileios Mezaris
Information Technologies Institute
Centre for Research and Technology Hellas
Thermi-Thessaloniki, Greece
Tel: +30 2311 257770
Email: bmezaris@iti.gr, web: http://www.iti.gr/~bmezaris/
This work was supported in part by the EU’s Horizon 2020 research and innovation programme under grant
agreement H2020-780656 ReTV.
1 of 38

More Related Content

Similar to GAN-based video summarization(20)

A04840107A04840107
A04840107
IOSR-JEN310 views
HTTP Adaptive Streaming – Quo Vadis? (2023)HTTP Adaptive Streaming – Quo Vadis? (2023)
HTTP Adaptive Streaming – Quo Vadis? (2023)
Alpen-Adria-Universität775 views
Video Coding Enhancements for HTTP Adaptive StreamingVideo Coding Enhancements for HTTP Adaptive Streaming
Video Coding Enhancements for HTTP Adaptive Streaming
Alpen-Adria-Universität540 views
Research@Lunch_Presentation.pdfResearch@Lunch_Presentation.pdf
Research@Lunch_Presentation.pdf
Vignesh V Menon50 views
Overview of Selected Current MPEG ActivitiesOverview of Selected Current MPEG Activities
Overview of Selected Current MPEG Activities
Alpen-Adria-Universität834 views
Overview of Selected Current MPEG ActivitiesOverview of Selected Current MPEG Activities
Overview of Selected Current MPEG Activities
Alpen-Adria-Universität440 views
C0161018C0161018
C0161018
IOSR Journals346 views
C0161018C0161018
C0161018
IOSR Journals405 views

GAN-based video summarization

  • 1. Thessaloniki, October 2020 GAN-based Video Summarization Vasileios Mezaris CERTH-ITI Presentation at the AI4Media Workshop on GANs for Media Content Generation 1 Joint work with E. Apostolidis, E. Adamantidou, A. Metsai (CERTH-ITI); I. Patras (QMUL)
  • 2. Thessaloniki, October 2020Vasileios Mezaris 2 Video summary: a short visual summary that encapsulates the flow of the story and the essential parts of the full-length video Original video Video summary (storyboard) Problem statement
  • 3. Thessaloniki, October 2020Vasileios Mezaris 3 Problem statement Applications of video summarization  Professional CMS: effective indexing, browsing, retrieval & promotion of media assets  Video sharing platforms: improved viewer experience, enhanced viewer engagement & increased content consumption  Other summarization scenarios: movie trailer production, sports highlights video generation, video synopsis of 24h surveillance recordings
  • 4. Thessaloniki, October 2020Vasileios Mezaris 4 Related work Deep-learning approaches  Various supervised methods (i.e., learning from ground-truth manually-generated summaries)  Using feedforward neural nets (CNNs) for e.g. identifying semantically-important video parts  Exploiting video-level metadata  Capturing the story flow using recurrent neural nets (e.g. LSTMs)  …and many more  Unsupervised algorithms that do not rely on human-annotations, and build summaries  Using adversarial learning to: minimize the distance between videos and their summary-based reconstructions; maximize the mutual information between summary and video; learn a mapping from raw videos to human-like summaries based on online available summaries  …and a few more approaches (see tutorial at IEEE ICME 2020, https://www.slideshare.net/VasileiosMezaris/icme2020-tutorial-videosummarizationpart1) + No need for training data (limited, hard to produce) + Avoid the subjectivity & biases of manually-generated summaries + Adaptability to different types of video
  • 5. Thessaloniki, October 2020Vasileios Mezaris GANs for unsupervised video summarization  Our starting point: the SUM-GAN architecture [1]  Main idea: build a keyframe selection mechanism by minimizing the distance between the deep representations of the original video and a reconstructed version of it based on the selected keyframes  Problem: how to define a good distance?  Solution: use a trainable discriminator network!  Goal: train the Summarizer to maximally confuse the Discriminator when distinguishing the original from the reconstructed video 5 SUM-GAN [1] B. Mahasseni, M. Lam, S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks“, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2982–2991.
  • 6. Thessaloniki, October 2020Vasileios Mezaris  Introduces two extensions [2]:  A linear compression layer that reduces the size of the CNN feature vectors  An incremental and fine-grained approach to train the model’s components [2] E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label- based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019. 6 SUM-GAN-sl GANs for unsupervised video summarization
  • 7. Thessaloniki, October 2020Vasileios Mezaris  Incremental approach to train the model’s components 7 SUM-GAN-sl GANs for unsupervised video summarization
  • 8. Thessaloniki, October 2020Vasileios Mezaris 8 (regularization factor) SUM-GAN-sl GANs for unsupervised video summarization  Incremental approach to train the model’s components
  • 9. Thessaloniki, October 2020Vasileios Mezaris 9 SUM-GAN-sl GANs for unsupervised video summarization  Incremental approach to train the model’s components
  • 10. Thessaloniki, October 2020Vasileios Mezaris 10 SUM-GAN-sl GANs for unsupervised video summarization  Incremental approach to train the model’s components
  • 11. Thessaloniki, October 2020Vasileios Mezaris  Adversarial learning driven by deterministic attention auto-encoder  The VAE in previous architecture was entirely replaced by an attention auto-encoder (AAE) network, forming the SUM-GAN-AAE architecture [3] [3] E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020. 11 SUM-GAN-AAE GANs for unsupervised video summarization
  • 12. Thessaloniki, October 2020Vasileios Mezaris 12 Attention auto-encoder Processing pipeline SUM-GAN-AAE GANs for unsupervised video summarization
  • 13. Thessaloniki, October 2020Vasileios Mezaris 13 Processing pipeline  Weighted feature vectors fed to the Encoder Attention auto-encoder SUM-GAN-AAE GANs for unsupervised video summarization
  • 14. Thessaloniki, October 2020Vasileios Mezaris 14 Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  For t > 1: use the hidden state of the previous Decoder’s step (h1)  For t = 1: use the hidden state of the last Encoder’s step (He) Attention auto-encoder SUM-GAN-AAE GANs for unsupervised video summarization
  • 15. Thessaloniki, October 2020Vasileios Mezaris 15 Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using: Attention auto-encoder SUM-GAN-AAE GANs for unsupervised video summarization
  • 16. Thessaloniki, October 2020Vasileios Mezaris Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function 16 Attention auto-encoder SUM-GAN-AAE GANs for unsupervised video summarization
  • 17. Thessaloniki, October 2020Vasileios Mezaris Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’ 17 Attention auto-encoder SUM-GAN-AAE GANs for unsupervised video summarization
  • 18. Thessaloniki, October 2020Vasileios Mezaris Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1 18 Attention auto-encoder SUM-GAN-AAE GANs for unsupervised video summarization
  • 19. Thessaloniki, October 2020Vasileios Mezaris 19 Attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1  Decoder gradually reconstructs the video SUM-GAN-AAE GANs for unsupervised video summarization
  • 20. Thessaloniki, October 2020Vasileios Mezaris Video summarization practicalities  Input: The CNN feature vectors of the (sampled) video frames  Output: Frame-level importance scores  Summarization process:  CNN features pass through the linear compression layer and the frame selector  importance scores computed at frame-level  Given a video segmentation (using KTS) calculate fragment-level importance scores by averaging the scores of each fragment's frames  Summary is created by selecting the fragments that maximize the total importance score provided that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem 20 Model’s I/O and summarization process
  • 21. Thessaloniki, October 2020Vasileios Mezaris Experiments 21 Datasets  SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)  25 videos capturing multiple events (e.g. cooking and sports)  video length: 1 to 6 min  annotation: fragment-based video summaries  TVSum (https://github.com/yalesong/tvsum)  50 videos from 10 categories of TRECVid MED task  video length: 1 to 11 min  annotation: frame-level importance scores
  • 22. Thessaloniki, October 2020Vasileios Mezaris Experiments 22 Evaluation protocol  The generated summary should not exceed 15% of the video length  Similarity between automatically generated (A) and ground-truth (G) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration)  Typical metrics for computing Precision and Recall at the frame-level
  • 23. Thessaloniki, October 2020Vasileios Mezaris Experiments 23 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach in the literature
  • 24. Thessaloniki, October 2020Vasileios Mezaris Experiments 24 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach in the literature
  • 25. Thessaloniki, October 2020Vasileios Mezaris Experiments 25 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach in the literature F-Score1
  • 26. Thessaloniki, October 2020Vasileios Mezaris Experiments 26 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach in the literature F-Score2 F-Score1
  • 27. Thessaloniki, October 2020Vasileios Mezaris Experiments 27 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach in the literature F-ScoreN F-Score2 F-Score1
  • 28. Thessaloniki, October 2020Vasileios Mezaris Experiments 28 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach in the literature F-ScoreN F-Score2 F-Score1 SumMe: TVSum: N
  • 29. Thessaloniki, October 2020Vasileios Mezaris Experiments 29 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach
  • 30. Thessaloniki, October 2020Vasileios Mezaris Experiments 30 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach F-Score
  • 31. Thessaloniki, October 2020Vasileios Mezaris  Videos were down-sampled to 2 fps  Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet  Linear compression layer reduces the size of these vectors from 1024 to 500  All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM  Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s learning rate = 10-5  Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing set having the remaining 20% of data  Ran experiments on 5 differently created random splits and report the average performance at the training-epoch-level (i.e. for the same training epoch) over these runs Experiments 31 Implementation details
  • 32. Thessaloniki, October 2020Vasileios Mezaris  Comparison with SoA unsupervised approaches based on multiple user summaries  Outcomes  A few SoA methods are comparable (or even worse) with a random summary generator  Best method on TVSum shows random-level performance on SumMe  Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum  Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two latent spaces in parallel to the continuous update of the model's components during the training  Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl Experiments 32 Note: SUM-GAN is not listed in this table as it follows the single gt-summary evaluation protocol
  • 33. Thessaloniki, October 2020Vasileios Mezaris  Evaluating the effect of the AAE component  Training efficiency: much faster and more stable training of the model Experiments 33 Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
  • 34. Thessaloniki, October 2020Vasileios Mezaris  Comparison with SoA supervised approaches based on multiple user summaries  Outcomes  Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as they exhibit random-level performance on SumMe  Only a few supervised methods surpass the performance of a random summary generator on both datasets, with VASNet being the best among them  The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum  Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods Experiments 34 +/- indicate better/worse performance compared to SUM-GAN-AAE
  • 35. Thessaloniki, October 2020Vasileios Mezaris Adapting / re-purposing the content  Main requirements:  Target distribution platforms & devices have varying requirements (e.g. the optimal duration of a video differs from one platform to another)  Target audiences have different preferences / information needs  Video summarization:  Create editions of the content that are adapted to different platforms and audiences 35
  • 36. Thessaloniki, October 2020Vasileios Mezaris Adapting / re-purposing the content Web application [4] for video summarization (try it with your video!): http://multimedia2.iti.gr/videosummarization/service/start.html Demo video: https://youtu.be/LbjPLJzeNII 36 [4] C. Collyda, K. Apostolidis, E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, "A Web Service for Video Summarization", Proc. ACM Int. Conf. on Interactive Media Experiences (IMX 2020), Barcelona, Spain, June 2020.
  • 37. Thessaloniki, October 2020Vasileios Mezaris  Presented two new video summarization methods, making use of:  The learning efficiency of the generative adversarial networks for unsupervised training  The effectiveness of attention mechanisms in spotting the most important parts of the video  Experimental evaluations on two benchmarking datasets  Documented the positive contribution of the introduced attention auto-encoder component in the model's training and summarization performance  Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video summarization techniques  Used GANs in a new web application for video summarization  Keep in mind: complete automation is sometimes not desired! (AI + human symbiosis is key) Conclusions 37
  • 38. Thessaloniki, October 2020Vasileios Mezaris Questions? 38 Contact: Dr. Vasileios Mezaris Information Technologies Institute Centre for Research and Technology Hellas Thermi-Thessaloniki, Greece Tel: +30 2311 257770 Email: bmezaris@iti.gr, web: http://www.iti.gr/~bmezaris/ This work was supported in part by the EU’s Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV.