Unsupervised Video Summarization via Attention-Driven Adversarial Learning

retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Unsupervised Video Summarization via Attention-Driven
Adversarial Learning
E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2
1 CERTH-ITI, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
26th Int. Conf. on Multimedia Modeling
Daejeon, Korea, January 2020

Outline
2
 Introduction
 Motivation
 Developed approach
 Experiments
 Summarization example
 Conclusions

3
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem statement

4
Problem statement
Applications of video summarization
 Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets
 Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption
 Other summarization scenarios: movie trailer production, sports highlights video generation,
video synopsis of 24h surveillance recordings

5
Related work
Deep-learning approaches
 Supervised methods that use types of feedforward neural nets (e.g. CNNs), extract and use
video semantics to identify important video parts based on sequence labeling [21], self-
attention networks [7], or video-level metadata [17]
 Supervised approaches that capture the story flow using recurrent neural nets (e.g. LSTMs)
 In combination with statistical models to select a representative and diverse set of keyframes [27]
 In hierarchies to identify the video structure and select key-fragments [30, 31]
 In combination with DTR units and GANs to capture long-range frame dependency [28]
 To form attention-based encoder-decoders [9, 13], or memory-augmented networks [8]
 Unsupervised algorithms that do not rely on human-annotations, and build summaries
 Using adversarial learning to: minimize the distance between videos and their summary-based
reconstructions [1, 16]; maximize the mutual information between summary and video [25]; learn a
mapping from raw videos to human-like summaries based on online available summaries [20]
 Through a decision-making process that is learned via RL and reward functions [32]
 By learning to extract key motions of appearing objects [29]

6
Motivation
Disadvantages of supervised learning
 Restricted amount of annotated data is available for supervised training of a video
summarization method
 Highly-subjective nature of video summarization (relying on viewer’s demands and aesthetics);
there is no “ideal” or commonly accepted summary that could be used for training an algorithm
Advantages of unsupervised learning
 No need for learning data; avoid laborious and time-demanding labeling of video data
 Adaptability to different types of video; summarization is learned based on the video content

7
Contributions
 Introduce an attention mechanism in an unsupervised learning framework, whereas all
previous attention-based summarization methods ([7-9, 13]) were supervised
 Investigate the integration of an attention mechanism into a variational auto-encoder for video
summarization purposes
 Use attention to guide the generative adversarial training of the model, rather than using it to
rank the video fragments as in [9]

Developed approach
 Starting point: the SUM-GAN architecture
 Main idea: build a keyframe selection mechanism
by minimizing the distance between the deep
representations of the original video and a
reconstructed version of it based on the selected
keyframes
 Problem: how to define a good distance?
 Solution: use a trainable discriminator network!
 Goal: train the Summarizer to maximally confuse
the Discriminator when distinguishing the original
from the reconstructed video
8
Building on adversarial learning

Developed approach
 SUM-GAN-sl:
 Contains a linear compression layer that reduces
the size of CNN feature vectors
9

Developed approach
 SUM-GAN-sl:
 Contains a linear compression layer that reduces the size of CNN feature vectors
 Follows an incremental and fine-grained approach to train the model’s components
10

Developed approach
11
 SUM-GAN-sl:
(regularization factor)

Developed approach
12
 SUM-GAN-sl:

Developed approach
13
 SUM-GAN-sl:

Developed approach
 Examined approaches:
1) Integrate an attention layer within the variational auto-encoder (VAE) of SUM-GAN-sl
2) Replace the VAE of SUM-GAN-sl with a deterministic attention auto-encoder
14
Introducing an attention mechanism

Developed approach
 Variational attention was described in [4] and used for natural language modeling
 Models the attention vector as Gaussian distributed random variables
15
1) Adversarial learning driven by a variational attention auto-encoder
Variational auto-encoder

Developed approach
 Extended SUM-GAN-sl with variational attention, forming the SUM-GAN-VAAE architecture
 The attention weights for each frame were handled as random variables and a latent space
was computed for these values, too
 In every time-step t the attention component combines the encoder's output at t and the
decoder's hidden state at t - 1 to compute an attention weight vector
 The decoder was modified to update its hidden states based on both latent spaces during the
reconstruction of the video
16
1) Adversarial learning driven by a variational attention auto-encoder
Variational attention
auto-encoder

Developed approach
 Inspired by the efficiency of attention-based
encoder-decoder network in [13]
 Built on the findings of [4] w.r.t. the impact of
deterministic attention on VAE
 VAE was entirely replaced by an attention auto-
encoder (AAE) network, forming the SUM-
GAN-AAE architecture
17
2) Adversarial learning driven by deterministic attention auto-encoder

Developed approach
18
Attention auto-encoder
Processing pipeline

Developed approach
19
Processing pipeline
 Weighted feature vectors fed to the Encoder

Developed approach
20
Processing pipeline
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 For t > 1: use the hidden state of the previous
Decoder’s step (h1)
 For t = 1: use the hidden state of the last
Encoder’s step (He)

Developed approach
21
Processing pipeline
 Attention weights (αt) computed using:

Processing pipeline
 Energy score function
 Soft-max function
Developed approach
22

Processing pipeline
 αt multiplied with V and form Context Vector vt’
Developed approach
23

Processing pipeline
 vt’ combined with Decoder’s previous output yt-1
Developed approach
24

Developed approach
25
Processing pipeline
 vt’ combined with Decoder’s previous output yt-1
 Decoder gradually reconstructs the video

Developed approach
 Input: The CNN feature vectors of the (sampled) video frames
 Output: Frame-level importance scores
 Summarization process:
 CNN features pass through the linear compression layer and the frame selector  importance
scores computed at frame-level
 Given a video segmentation (using KTS [18]) calculate fragment-level importance scores by
averaging the scores of each fragment's frames
 Summary is created by selecting the fragments that maximize the total importance score provided
that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem
26
Model’s I/O and summarization process

Experiments
27
Datasets
 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
 25 videos capturing multiple events (e.g. cooking and sports)
 video length: 1.6 to 6.5 min
 annotation: fragment-based video summaries
 TVSum (https://github.com/yalesong/tvsum)
 50 videos from 10 categories of TRECVid MED task
 video length: 1 to 5 min
 annotation: frame-level importance scores

Experiments
28
Evaluation protocol
 The generated summary should not exceed 15% of the video length
 Similarity between automatically generated (A) and ground-truth (G) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
 Typical metrics for computing Precision and Recall at the frame-level

Experiments
29
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])

Experiments
30
Evaluation protocol

Experiments
31
Evaluation protocol
F-Score1

Experiments
32
Evaluation protocol
F-Score2
F-Score1

Experiments
33
Evaluation protocol
F-ScoreN
F-Score2
F-Score1

Experiments
34
Evaluation protocol
F-ScoreN
F-Score2
F-Score1
SumMe: TVSum:
N

Experiments
35
Evaluation protocol
 Alternative approach (used in [9, 13, 16, 24, 25, 28])

Experiments
36
Evaluation protocol
 Alternative approach (used in [9, 13, 16, 24, 25, 28])
F-Score

 Videos were down-sampled to 2 fps
 Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet
 Linear compression layer reduces the size of these vectors from 1024 to 500
 All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM
 Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s
learning rate = 10-5
 Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing
set having the remaining 20% of data
 Ran experiments on 5 differently created random splits and report the average performance at
the training-epoch-level (i.e. for the same training epoch) over these runs
Experiments
37
Implementation details

 Step 1: Assessing the impact of regularization factor σ
Experiments
38
SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE

 Step 1: Assessing the impact of regularization factor σ
 Outcomes:
 Value of σ affects the models’ performance and needs fine-tuning
 Fine-tuning is dataset-dependent
 Best overall performance for each model, observed for different σ value
Experiments
39
SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE

 Step 2: Comparison with SoA unsupervised approaches based on multiple user summaries
 Outcomes
 A few SoA methods are comparable (or even worse) with a random summary generator
 Best method on TVSum shows random-level performance on SumMe
 Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum
 Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two
latent spaces in parallel to the continuous update of the model's components during the training
 Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl
Experiments
40
+/- indicate better/worse performance
compared to SUM-GAN-AAE
Note: SUM-GAN is not listed in this table as it follows
the single gt-summary evaluation protocol

 Step 3: Evaluating the effect of the introduced AAE component
 Key-fragment selection: Attention mechanism leads to much smoother series of importance
scores
Experiments
41

 Step 3: Evaluating the effect of the introduced AAE component
 Training efficiency: much faster and more stable training of the model
Experiments
42
Loss curves for the SUM-GAN-sl and SUM-GAN-AAE

 Step 4: Comparison with SoA supervised approaches based on multiple user summaries
 Outcomes
 Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as
they exhibit random-level performance on SumMe
 Only a few supervised methods surpass the performance of a random summary generator on both
datasets, with VASNet being the best among them
 The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum
 Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods
Experiments
43

 Step 5: Comparison with SoA approaches based on single ground-truth summaries
 Impact of regularization factor σ (best scores in bold)
 The model’s performance is affected by the value of σ
 The effect of σ depends (also) on the evaluation approach; best performance when using multiple
human summaries was observed for σ = 0.15
 SUM-GAN-AAE outperforms the original SUM-GAN model on both datasets, even for the same value
of σ
Experiments
44

 Step 5: Comparison with SoA approaches based on single ground-truth summaries
 Outcomes
 SUM-GAN-AAE model performs consistently well on both datasets
 SUM-GAN-AAE shows advanced performance compared to SoA supervised and unsupervised (*)
summarization methods
Experiments
45
Unsupervised approaches
marked with an asterisk

Summarization example
46
Full video Generated summary

 Presented a video summarization method that combines:
 The effectiveness of attention mechanisms in spotting the most important parts of the video
 The learning efficiency of the generative adversarial networks for unsupervised training
 Experimental evaluations on two benchmarking datasets:
 Documented the positive contribution of the introduced attention auto-encoder component in the
model's training and summarization performance
 Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video
summarization techniques
Conclusions
47

1. E. Apostolidis, et al.: A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In:
AI4TV, ACM MM 2019
2. E. Apostolidis, et al.: Fast shot segmentation combining global and local visual descriptors. In: IEEE ICASSP 2014. pp. 6583-6587
3. K. Apostolidis, et al.: A motion-driven approach for fine-grained temporal segmentation of user-generated videos. In: MMM 2018. pp. 29-41
4. H. Bahuleyan, et al.: Variational attention for sequence-to-sequence models. In: 27th COLING. pp. 1672-1682 (2018)
5. J. Cho: PyTorch implementation of SUM-GAN (2017), https://github.com/j-min/Adversarial Video Summary, (last accessed on Oct. 18, 2019)
6. M. Elfeki, et al.: Video summarization via actionness ranking. In: IEEE WACV 2019. pp. 754-763
7. J. Fajtl, et al.: Summarizing videos with attention. In: ACCV 2018. pp. 39-54
8. L. Feng, et al.: Extractive video summarizer with memory augmented neural networks. In: ACM MM 2018. pp. 976-983
9. T. Fu, et al.: Attentive and adversarial learning for video summarization. In: IEEE WACV 2019. pp. 1579-1587
10. M. Gygli, et al.: Creating summaries from user videos. In: ECCV 2014. pp. 505-520
11. M. Gygli, et al.: Video summarization by learning submodular mixtures of objectives. In: IEEE CVPR 2015. pp. 3090-3098
12. S. Hochreiter, et al.: Long Short-Term Memory. Neural Computation 9(8), 1735-1780 (1997)
13. Z. Ji, et al.: Video summarization with attention-based encoder-decoder networks. IEEE Trans. on Circuits and Systems for Video
Technology (2019)
14. D. Kaufman, et al.: Temporal Tessellation: A unified approach for video analysis. In: IEEE ICCV 2017. pp. 94-104
15. S. Lee, et al.: A memory network approach for story-based temporal summarization of 360 videos. In: IEEE CVPR 2018. pp. 1410-1419
16. B. Mahasseni, et al.: Unsupervised video summarization with adversarial LSTM networks. In: IEEE CVPR 2017. pp. 2982-2991
17. M. Otani, et al.: Video summarization using deep semantic features. In: ACCV 2016. pp. 361-377
Key references
48

18. D. Potapov, et al.: Category-specific video summarization. In: ECCV 2014. pp. 540-555
19. A. Radford, et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR 2016
20. M. Rochan, et al.: Video summarization by learning from unpaired data. In: IEEE CVPR 2019
21. M. Rochan, et al.: Video summarization using fully convolutional sequence networks. In: ECCV 2018. pp. 358-374
22. Y. Song, et al.: TVSum: Summarizing web videos using titles. In: IEEE CVPR 2015. pp. 5179-5187
23. C. Szegedy, et al.: Going deeper with convolutions. In: IEEE CVPR 2015. pp. 1-9
24. H. Wei, et al.: Video summarization via semantic attended networks. In: AAAI 2018. pp. 216-223
25. L. Yuan, et al.: Cycle-SUM: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In: AAAI 2019. pp. 9143-
9150
26. Y. Yuan, et al.: Video summarization by learning deep side semantic embedding. IEEE Trans. on Circuits and Systems for Video Technology
29(1), 226-237 (2019)
27. K. Zhang, et al.: Video summarization with Long Short-Term Memory. In: ECCV 2016. pp. 766-782
28. Y. Zhang, et al.: DTR-GAN: Dilated temporal relational adversarial network for video summarization. In: ACM TURC 2019. pp. 89:1-89:6
29. Y. Zhang, et al.: Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognition Letters (2018)
30. B. Zhao, et al.: Hierarchical recurrent neural network for video summarization. In: ACM MM 2017. pp. 863-871
31. B. Zhao, et al.: HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In: IEEE/CVF CVPR 2018. pp. 7405-7414
32. K. Zhou, et al.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI
2018. pp. 7582-7589
33. K. Zhou, et al.: Video summarisation by classification with deep reinforcement learning. In: BMVC 2018
Key references
49

50
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/SUM-GAN-AAE
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV. The work of Ioannis
Patras has been supported by EPSRC under grant No. EP/R026424/1.

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unsupervised Video Summarization via Attention-Driven Adversarial Learning

Similar to Unsupervised Video Summarization via Attention-Driven Adversarial Learning (20)

More from VasileiosMezaris

More from VasileiosMezaris (20)

Recently uploaded

Recently uploaded (20)

Unsupervised Video Summarization via Attention-Driven Adversarial Learning