Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

69 views

Published on

"Unsupervised Video Summarization via Attention-Driven Adversarial Learning", by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras. Proceedings of the 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

  1. 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Title of presentation Subtitle Name of presenter Date Unsupervised Video Summarization via Attention-Driven Adversarial Learning E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2 1 CERTH-ITI, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 26th Int. Conf. on Multimedia Modeling Daejeon, Korea, January 2020
  2. 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Outline 2  Introduction  Motivation  Developed approach  Experiments  Summarization example  Conclusions
  3. 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 Video summary: a short visual summary that encapsulates the flow of the story and the essential parts of the full-length video Original video Video summary (storyboard) Problem statement
  4. 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 Problem statement Applications of video summarization  Professional CMS: effective indexing, browsing, retrieval & promotion of media assets  Video sharing platforms: improved viewer experience, enhanced viewer engagement & increased content consumption  Other summarization scenarios: movie trailer production, sports highlights video generation, video synopsis of 24h surveillance recordings
  5. 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 Related work Deep-learning approaches  Supervised methods that use types of feedforward neural nets (e.g. CNNs), extract and use video semantics to identify important video parts based on sequence labeling [21], self- attention networks [7], or video-level metadata [17]  Supervised approaches that capture the story flow using recurrent neural nets (e.g. LSTMs)  In combination with statistical models to select a representative and diverse set of keyframes [27]  In hierarchies to identify the video structure and select key-fragments [30, 31]  In combination with DTR units and GANs to capture long-range frame dependency [28]  To form attention-based encoder-decoders [9, 13], or memory-augmented networks [8]  Unsupervised algorithms that do not rely on human-annotations, and build summaries  Using adversarial learning to: minimize the distance between videos and their summary-based reconstructions [1, 16]; maximize the mutual information between summary and video [25]; learn a mapping from raw videos to human-like summaries based on online available summaries [20]  Through a decision-making process that is learned via RL and reward functions [32]  By learning to extract key motions of appearing objects [29]
  6. 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 Motivation Disadvantages of supervised learning  Restricted amount of annotated data is available for supervised training of a video summarization method  Highly-subjective nature of video summarization (relying on viewer’s demands and aesthetics); there is no “ideal” or commonly accepted summary that could be used for training an algorithm Advantages of unsupervised learning  No need for learning data; avoid laborious and time-demanding labeling of video data  Adaptability to different types of video; summarization is learned based on the video content
  7. 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 Contributions  Introduce an attention mechanism in an unsupervised learning framework, whereas all previous attention-based summarization methods ([7-9, 13]) were supervised  Investigate the integration of an attention mechanism into a variational auto-encoder for video summarization purposes  Use attention to guide the generative adversarial training of the model, rather than using it to rank the video fragments as in [9]
  8. 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Starting point: the SUM-GAN architecture  Main idea: build a keyframe selection mechanism by minimizing the distance between the deep representations of the original video and a reconstructed version of it based on the selected keyframes  Problem: how to define a good distance?  Solution: use a trainable discriminator network!  Goal: train the Summarizer to maximally confuse the Discriminator when distinguishing the original from the reconstructed video 8 Building on adversarial learning
  9. 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors 9 Building on adversarial learning
  10. 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components 10 Building on adversarial learning
  11. 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 11 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components (regularization factor)
  12. 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 12 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  13. 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 13 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  14. 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Examined approaches: 1) Integrate an attention layer within the variational auto-encoder (VAE) of SUM-GAN-sl 2) Replace the VAE of SUM-GAN-sl with a deterministic attention auto-encoder 14 Introducing an attention mechanism
  15. 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Variational attention was described in [4] and used for natural language modeling  Models the attention vector as Gaussian distributed random variables 15 1) Adversarial learning driven by a variational attention auto-encoder Variational auto-encoder
  16. 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Extended SUM-GAN-sl with variational attention, forming the SUM-GAN-VAAE architecture  The attention weights for each frame were handled as random variables and a latent space was computed for these values, too  In every time-step t the attention component combines the encoder's output at t and the decoder's hidden state at t - 1 to compute an attention weight vector  The decoder was modified to update its hidden states based on both latent spaces during the reconstruction of the video 16 1) Adversarial learning driven by a variational attention auto-encoder Variational attention auto-encoder
  17. 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Inspired by the efficiency of attention-based encoder-decoder network in [13]  Built on the findings of [4] w.r.t. the impact of deterministic attention on VAE  VAE was entirely replaced by an attention auto- encoder (AAE) network, forming the SUM- GAN-AAE architecture 17 2) Adversarial learning driven by deterministic attention auto-encoder
  18. 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 18 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder Processing pipeline
  19. 19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 19 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder Attention auto-encoder
  20. 20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 20 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  For t > 1: use the hidden state of the previous Decoder’s step (h1)  For t = 1: use the hidden state of the last Encoder’s step (He) Attention auto-encoder
  21. 21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 21 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using: Attention auto-encoder
  22. 22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function Developed approach 22 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  23. 23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’ Developed approach 23 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  24. 24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1 Developed approach 24 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  25. 25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 25 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1  Decoder gradually reconstructs the video
  26. 26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Input: The CNN feature vectors of the (sampled) video frames  Output: Frame-level importance scores  Summarization process:  CNN features pass through the linear compression layer and the frame selector  importance scores computed at frame-level  Given a video segmentation (using KTS [18]) calculate fragment-level importance scores by averaging the scores of each fragment's frames  Summary is created by selecting the fragments that maximize the total importance score provided that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem 26 Model’s I/O and summarization process
  27. 27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 27 Datasets  SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)  25 videos capturing multiple events (e.g. cooking and sports)  video length: 1.6 to 6.5 min  annotation: fragment-based video summaries  TVSum (https://github.com/yalesong/tvsum)  50 videos from 10 categories of TRECVid MED task  video length: 1 to 5 min  annotation: frame-level importance scores
  28. 28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 28 Evaluation protocol  The generated summary should not exceed 15% of the video length  Similarity between automatically generated (A) and ground-truth (G) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration)  Typical metrics for computing Precision and Recall at the frame-level
  29. 29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 29 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
  30. 30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 30 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
  31. 31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 31 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-Score1
  32. 32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 32 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-Score2 F-Score1
  33. 33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 33 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-ScoreN F-Score2 F-Score1
  34. 34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 34 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-ScoreN F-Score2 F-Score1 SumMe: TVSum: N
  35. 35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 35 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach (used in [9, 13, 16, 24, 25, 28])
  36. 36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 36 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach (used in [9, 13, 16, 24, 25, 28]) F-Score
  37. 37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Videos were down-sampled to 2 fps  Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet  Linear compression layer reduces the size of these vectors from 1024 to 500  All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM  Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s learning rate = 10-5  Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing set having the remaining 20% of data  Ran experiments on 5 differently created random splits and report the average performance at the training-epoch-level (i.e. for the same training epoch) over these runs Experiments 37 Implementation details
  38. 38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 1: Assessing the impact of regularization factor σ Experiments 38 SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
  39. 39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 1: Assessing the impact of regularization factor σ  Outcomes:  Value of σ affects the models’ performance and needs fine-tuning  Fine-tuning is dataset-dependent  Best overall performance for each model, observed for different σ value Experiments 39 SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
  40. 40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 2: Comparison with SoA unsupervised approaches based on multiple user summaries  Outcomes  A few SoA methods are comparable (or even worse) with a random summary generator  Best method on TVSum shows random-level performance on SumMe  Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum  Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two latent spaces in parallel to the continuous update of the model's components during the training  Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl Experiments 40 +/- indicate better/worse performance compared to SUM-GAN-AAE Note: SUM-GAN is not listed in this table as it follows the single gt-summary evaluation protocol
  41. 41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 3: Evaluating the effect of the introduced AAE component  Key-fragment selection: Attention mechanism leads to much smoother series of importance scores Experiments 41
  42. 42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 3: Evaluating the effect of the introduced AAE component  Training efficiency: much faster and more stable training of the model Experiments 42 Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
  43. 43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 4: Comparison with SoA supervised approaches based on multiple user summaries  Outcomes  Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as they exhibit random-level performance on SumMe  Only a few supervised methods surpass the performance of a random summary generator on both datasets, with VASNet being the best among them  The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum  Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods Experiments 43
  44. 44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 5: Comparison with SoA approaches based on single ground-truth summaries  Impact of regularization factor σ (best scores in bold)  The model’s performance is affected by the value of σ  The effect of σ depends (also) on the evaluation approach; best performance when using multiple human summaries was observed for σ = 0.15  SUM-GAN-AAE outperforms the original SUM-GAN model on both datasets, even for the same value of σ Experiments 44
  45. 45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 5: Comparison with SoA approaches based on single ground-truth summaries  Outcomes  SUM-GAN-AAE model performs consistently well on both datasets  SUM-GAN-AAE shows advanced performance compared to SoA supervised and unsupervised (*) summarization methods Experiments 45 Unsupervised approaches marked with an asterisk
  46. 46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Summarization example 46 Full video Generated summary
  47. 47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Presented a video summarization method that combines:  The effectiveness of attention mechanisms in spotting the most important parts of the video  The learning efficiency of the generative adversarial networks for unsupervised training  Experimental evaluations on two benchmarking datasets:  Documented the positive contribution of the introduced attention auto-encoder component in the model's training and summarization performance  Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video summarization techniques Conclusions 47
  48. 48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 1. E. Apostolidis, et al.: A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In: AI4TV, ACM MM 2019 2. E. Apostolidis, et al.: Fast shot segmentation combining global and local visual descriptors. In: IEEE ICASSP 2014. pp. 6583-6587 3. K. Apostolidis, et al.: A motion-driven approach for fine-grained temporal segmentation of user-generated videos. In: MMM 2018. pp. 29-41 4. H. Bahuleyan, et al.: Variational attention for sequence-to-sequence models. In: 27th COLING. pp. 1672-1682 (2018) 5. J. Cho: PyTorch implementation of SUM-GAN (2017), https://github.com/j-min/Adversarial Video Summary, (last accessed on Oct. 18, 2019) 6. M. Elfeki, et al.: Video summarization via actionness ranking. In: IEEE WACV 2019. pp. 754-763 7. J. Fajtl, et al.: Summarizing videos with attention. In: ACCV 2018. pp. 39-54 8. L. Feng, et al.: Extractive video summarizer with memory augmented neural networks. In: ACM MM 2018. pp. 976-983 9. T. Fu, et al.: Attentive and adversarial learning for video summarization. In: IEEE WACV 2019. pp. 1579-1587 10. M. Gygli, et al.: Creating summaries from user videos. In: ECCV 2014. pp. 505-520 11. M. Gygli, et al.: Video summarization by learning submodular mixtures of objectives. In: IEEE CVPR 2015. pp. 3090-3098 12. S. Hochreiter, et al.: Long Short-Term Memory. Neural Computation 9(8), 1735-1780 (1997) 13. Z. Ji, et al.: Video summarization with attention-based encoder-decoder networks. IEEE Trans. on Circuits and Systems for Video Technology (2019) 14. D. Kaufman, et al.: Temporal Tessellation: A unified approach for video analysis. In: IEEE ICCV 2017. pp. 94-104 15. S. Lee, et al.: A memory network approach for story-based temporal summarization of 360 videos. In: IEEE CVPR 2018. pp. 1410-1419 16. B. Mahasseni, et al.: Unsupervised video summarization with adversarial LSTM networks. In: IEEE CVPR 2017. pp. 2982-2991 17. M. Otani, et al.: Video summarization using deep semantic features. In: ACCV 2016. pp. 361-377 Key references 48
  49. 49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18. D. Potapov, et al.: Category-specific video summarization. In: ECCV 2014. pp. 540-555 19. A. Radford, et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR 2016 20. M. Rochan, et al.: Video summarization by learning from unpaired data. In: IEEE CVPR 2019 21. M. Rochan, et al.: Video summarization using fully convolutional sequence networks. In: ECCV 2018. pp. 358-374 22. Y. Song, et al.: TVSum: Summarizing web videos using titles. In: IEEE CVPR 2015. pp. 5179-5187 23. C. Szegedy, et al.: Going deeper with convolutions. In: IEEE CVPR 2015. pp. 1-9 24. H. Wei, et al.: Video summarization via semantic attended networks. In: AAAI 2018. pp. 216-223 25. L. Yuan, et al.: Cycle-SUM: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In: AAAI 2019. pp. 9143- 9150 26. Y. Yuan, et al.: Video summarization by learning deep side semantic embedding. IEEE Trans. on Circuits and Systems for Video Technology 29(1), 226-237 (2019) 27. K. Zhang, et al.: Video summarization with Long Short-Term Memory. In: ECCV 2016. pp. 766-782 28. Y. Zhang, et al.: DTR-GAN: Dilated temporal relational adversarial network for video summarization. In: ACM TURC 2019. pp. 89:1-89:6 29. Y. Zhang, et al.: Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognition Letters (2018) 30. B. Zhao, et al.: Hierarchical recurrent neural network for video summarization. In: ACM MM 2017. pp. 863-871 31. B. Zhao, et al.: HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In: IEEE/CVF CVPR 2018. pp. 7405-7414 32. K. Zhou, et al.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI 2018. pp. 7582-7589 33. K. Zhou, et al.: Video summarisation by classification with deep reinforcement learning. In: BMVC 2018 Key references 49
  50. 50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 50 Thank you for your attention! Questions? Vasileios Mezaris, bmezaris@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/SUM-GAN-AAE This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV. The work of Ioannis Patras has been supported by EPSRC under grant No. EP/R026424/1.

×