Icme2020 tutorial video_summarization_part1

retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.1: Video summarization
problem definition and literature
overview
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization

Tutorial’s structure and time schedule
2
 Section I.1: Video summarization problem definition and literature overview (20’)
 Q&A (5’)
 Section I.2: In-depth discussion on a few unsupervised GAN-based methods (20’)
 Q&A (5’)
 Section I.3: Datasets, evaluation protocols and results, and future directions (20’)
20’ Q&A and break, then we are back with the tutorial’s Part II: Video summaries re-use and
recommendation

3
Video is everywhere!
Problem definition
Hours of video content uploaded on
YouTube every minute
 Captured by smart-devices and instantly
shared online
 Constantly and rapidly increased
volumes of video content
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-
age-video-sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)

4
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/

5
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Quickly inspect a video’s
content by checking its
synopsis!

6
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Much
detailed

7
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Use of technologies for
content adaptation, re-use
and re-purposing!
Much
detailed

8
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem definition
Source: https://www.youtube.com/watch?v=deRF9oEbRso

9
Problem definition
General applications of video summarization
 Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets!
 Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption!
Source: https://www.redbytes.in/how-to-build-an-app-like-hotstar/ Source: Screenshot of the BBC News channel on YouTube

10
Problem definition
General applications of video summarization
Audience- and channel-specific content adaptation: video content re-use and re-distribution in
the most appropriate way!
Image source: https://www.databagg.com/online-video-sharing

11
Problem definition
Domain-specific applications of video summarization
Full movie (e.g. 1h 30’-2h) Movie trailer (2’30’’)
J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer
Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–1808.
Source: https://www.youtube.com/watch?v=wb49-oV0F78

12
Problem definition
Full game (e.g. 1h 30’)
Game’s synopsis & highlights (1’32’’)
Source: https://www.youtube.com/watch?v=oo-2IFTifUU

13
Problem definition
Video samples extracted from: https://www.youtube.com/watch?v=gk3qTMlcadk
Raw CCTV material (e.g. 24h) Summary of important actions/events (with timestamps)

14
Literature overview
Taxonomy of deep learning
based methods for automatic
video summarization

15
Literature overview
Supervised approaches: using video semantics and metadata
 [Zhang, 2016; Kaufman, 2017] learn and transfer the summary structure of
semantically-similar videos
 [Panda, 2017] metadata-driven video categorization and summarization by
maximizing relevance with the video category
 [Song, 2016; Zhou, 2018a] category-driven summarization by category feature
preservation (keep main parts of a wedding when summarizing a wedding video)
 [Otani, 2016; Yuan, 2019] maximize relevance of visual (video) and textual
(metadata) data in a common latent space

16
Literature overview
Supervised approaches: considering temporal structure and dependency
 [Zhang, 2016b] estimate frames’ importance by modeling their variable-range
temporal dependency using RNNs
 [Zhao, 2018] models and encodes the temporal structure of the video for
defining the key-fragments using hierarchies of RNNs
 [Ji, 2019] video-to-summary as a sequence-to-sequence learning problem using
attention-driven encoder-decoder network
 [Feng, 2018; Wang, 2019] estimate frames’ importance by modeling their long-
range dependency using high-capacity memory networks

17
Literature overview
Supervised approaches: imitating human summaries
 [Zhang, 2019] summarization by confusing a trainable discriminator when making
the distinction between a machine- and a human-generated summary; model the
variable-range temporal dependency using RNNs and Dilated Temporal Units
 [Fu, 2019] key-fragment selection by confusing a trainable discriminator when
making the distinction between the machine- and a human-selected key-fragments;
fragmentation based on attention-based Pointer Network, and discrimination using
a 3D-CNN classifier

18
Literature overview
Supervised approaches: targeting specific properties of the summary
 [Chu, 2019] models spatiotemporal information based on raw frames and optical
flow maps, and learns frames’ importance from human annotations via a label
distribution learning process
 [Elfeki, 2019] uses of CNNs and RNNs to form spatiotemporal feature vectors and
estimates the level of activity and importance of each frame to create the summary
 [Chen, 2019] summarization based on reinforcement learning and reward functions
associated to the diversity and representativeness of the video summary

19
Literature overview
Unsupervised approaches: inferring the original video
 [Mahasseni, 2017] SUM-GAN trains a summarizer to fool a discriminator when
distinguishing the original from the summary-based reconstructed video using
adversarial learning
 [Jung, 2019] CSNet extends [Mahasseni, 2017] with a chunk and stride network and
attention mechanism to assess variable-range dependencies and select the video key-
frames
 [Apostolidis, 2020] SUM-GAN-AAE extends [Mahasseni, 2017] with a stepwise, fine-
grained training strategy and an attention auto-encoder to improve the key-fragment
selection process
 [Rochan, 2019] UnpairedVSN learns video summarization from unpaired data based on
an adversarial process that defines a mapping function of a raw video to a human
summary

20
Literature overview
Unsupervised approaches: targeting specific properties of the summary
 [Zhou, 2018b] DR-DSN learns to create representative and diverse summaries via
reinforcement learning and relevant reward functions
 [Gonuguntla, 2019] EDSN extracts spatiotemporal information and learns
summarization by rewarding the maintenance of main spatiotemporal patterns in
the summary
 [Zhang, 2018] OnlineMotionAE extracts the key motions of appearing objects and
uses an online motion auto-encoder model to generate summaries that include the
main objects in the video and the attractive actions made by each of these objects

 DL-based video summarization methods mainly rely on combinations of CNNs and RNNs
 Pre-trained CNNs are used to represent the visual content; RNNs (mostly LSTMs) are used to
model the temporal dependency among video frames
 The proposed video summarization approaches are mostly supervised
 Best supervised approaches utilize tailored attention mechanisms or memory networks to
capture variable- and long-range temporal dependencies respectively
 For unsupervised video summarization GANs are the central direction and RL is another but
less common approach
 Best unsupervised approaches rely on VAE-GAN architectures that have been enhanced with
attention mechanisms
Some concluding remarks
21

 The generation of ground-truth data can be an expensive and laborious process
 Video summarization is a subjective task and multiple summaries can be proposed for a video
 Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
 Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
 Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
22

 The generation of ground-truth data can be an expensive and laborious process
 Video summarization is a subjective task and multiple summaries can be proposed for a video
 Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
 Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
 Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
23
Unsupervised video summarization has great advantages, increases the applicability
of summarization technologies, and its potential should be investigated

Vasileios Mezaris,
CERTH-ITI, Greece
Short break; coming up:
Section I.2: Discussion on a few
unsupervised GAN-based
methods

Vasileios Mezaris,
CERTH-ITI, Greece
Section I.2: Discussion on a few
unsupervised GAN-based
methods

The SUM-GAN method [Mahasseni, 2017]
 Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
 Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
 Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
26
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.

video is minimized
 Challenge: how to define a good distance?
27
Courtesy of
Mahasseni et al.

video is minimized
 Challenge: how to define a good distance?
 Solution: use a Discriminator network and train it with the
Summarizer in an adversarial manner
28
Courtesy of
Mahasseni et al.

 Deep features of video frames in Frame Selector
=> normalized importance scores
 Weighted features in Encoder => latent
representation e
 Latent representation e in Decoder => sequence of
features for the frames of input video
 Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
29
Training pipeline and loss functions

representation e
30

representation e
31

representation e
32

representation e
33

 Train Frame Selector and Encoder by minimizing
Lsparsity + Lprior + Lreconst
 Train Decoder by minimizing Lreconst + LGAN
 Train Discriminator by maximizing LGAN
 Update all components via backward propagation
using Stochastic Gradient Variational Bayes
estimation
34




35
Inference stage and video summarization
35
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores

The SUM-GAN-sl method [Apostolidis, 2019]
36
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
 Builds on the SUM-GAN architecture
 Contains a linear compression layer that
reduces the size of CNN feature vectors
 Follows an incremental and fine-grained
approach to train the model’s components

37

38

 Step-wise training process
39

40

41

42

 Deep features of video frames in LC layer and
Frame Selector => normalized importance scores



43
43

The SUM-GAN-AAE method [Apostolidis, 2020]
 Builds on the SUM-GAN-sl algorithm
 Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
44
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award

 Builds on the SUM-GAN-sl algorithm
 Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
45
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award

46
The attention auto-encoder: Processing pipeline

47
 Weighted feature vectors fed to the Encoder

48
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 For t > 1: use the hidden state of the previous
Decoder’s step (h1)
 For t = 1: use the hidden state of the last
Encoder’s step (He)

49
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function

50

 αt multiplied with V and form Context Vector vt’
51

 vt’ combined with Decoder’s previous output yt-1
52

 vt’ combined with Decoder’s previous output yt-1
 Decoder gradually reconstructs the video
53

 Training is performed in an incremental way as in SUM-GAN-sl
 No prior loss is used
54

 Deep features of video frames in LC layer and
Frame Selector => normalized importance scores



55
55

 Much smoother series of importance scores
56
Impact of the introduced attention mechanism

 Much faster and more stable training of the model
57
Impact of the introduced attention mechanism
Average (over 5 splits) learning curve of SUM-GAN-sl and
SUM-GAN-AAE on SumMeLoss curves for the SUM-GAN-sl and SUM-GAN-AAE

 The most common strategy for learning summarization in an unsupervised way
 A mechanism to build a representative summary by maximizing inference to the full video
 Summarization performance is superior to other unsupervised learning approaches (e.g.
reinforcement learning) and comparable to a few supervised learning methods
 Step-wise training facilitates the training of complex GAN-based architectures
 Introduction of attention mechanisms is beneficial to the quality of the created summary
 There is room for further improving GAN-based unsupervised video summarization via: a)
combination with reinforcement learning approaches, b) extension with memory networks
58
Using GANs for video summarization

Vasileios Mezaris,
CERTH-ITI, Greece
Short break; coming up:
Section I.3: Datasets, evaluation
protocols and results, and future
directions

Vasileios Mezaris,
CERTH-ITI, Greece
Section I.3: Datasets, evaluation
protocols and results, and future
directions

Datasets
61
 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
 25 videos capturing multiple events (e.g. cooking and sports)
 video length: 1 to 6 min
 annotation: fragment-based video summaries (15-18 per video)
 TVSum (https://github.com/yalesong/tvsum)
 50 videos from 10 categories of TRECVid MED task
 annotation: frame-level importance scores (20 per video)
Most commonly used

Datasets
62
 Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
 50 videos of various genres (e.g. documentary, educational, historical, lecture)
 annotation: keyframe-based video summaries (5 per video)
 Youtube (https://sites.google.com/site/vsummsite/download)
 50 videos of diverse content (e.g. cartoons, news, sports, commercials) collected from websites
 annotation: keyframe-based video summaries (5 per video)
Less commonly used

Evaluation protocols
63
Early approach
 Agreement between automatically-created (A) and user-defined (U) summary is expressed by
 Matching of a pair of frames is based on color histograms, the Manhattan distance and a
predefined similarity threshold
 80% of video samples are used for training and the remaining 20% for testing
 The final evaluation outcome occurs by:
 Computing the average F-Score for a test video given the different user summaries for this video
 Computing the average of the calculated F-Score values for the different test videos

64
Established approach
 The generated summary should not exceed 15% of the video length
 Agreement between automatically-generated (A) and user-defined (U) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
 Typical metrics for computing Precision and Recall at the frame-level
 80% of video samples are used for training and the remaining 20% for testing

65
Established approach - A side note
 TVSum annotations need conversion from frame-level importance scores to key-fragments
65
Human annotations in TVSum: frame-level importance scores

66
66

67
67

68

69
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach

70

71
F-Score1

72
F-Score2
F-Score1

73
F-ScoreN
F-Score2
F-Score1

74
F-ScoreN
F-Score2
F-Score1
SumMe: F-Score = max{F-Scorei}i=1
N
TVSum: F-Score = mean{F-Scorei}i=1
N

75
 Alternative approach

76
F-Score
 Alternative approach

Results: comparison of unsupervised methods
77
Method Reference
Online Motion AE [Zhang, 2018]
SUM-FCNunsup [Rochan, 2018]
DR-DSN [Zhou, 2018b]
EDSN [Gonuguntla, 2019]
UnpairedVSN [Rochan, 2019]
PCDL [Zhao, 2019]
ACGAN [He, 2019]
Tesselation [Kaufman, 2017]
SUM-GAN-sl [Apostolidis, 2019]
SUM-GAN-AAE [Apostolidis, 2020]
CSNet [Jung, 2019]

 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
78
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks

79
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks

80
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks

81
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks

82
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks

83
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks

84
General remarks
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5

 tbd
Results: comparison of supervised methods
85
Method Reference
vsLSTM [Zhang, 2016b]
dppLSTM [Zhang, 2016b]
SASUMwsup [Wei. 2018]
ActionRanking [Elfeki, 2019]
ESS-VS [Zhang, 2016a]
H-RNN [Zhao, 2017]
vsLSTM+Att [Lebron Casas, 2019]
DSSE [Yuan, 2019b]
DR-DSNsup [Zhou, 2018b]
Tessellationsup [Kaufman, 2017]
Method Reference
dppLSTM+Att [Lebron Casas, 2019]
WS-HRL [Chen, 2019]
UnpairedVSNsup [Rochan, 2019]
SUM-FCN [Rochan, 2018]
SF-CVS [Huang, 2020]
SASUMsup [Wei, 2018]
CRSum [Yuan, 2019c]
PCDLsup [Zhao, 2019]
MAVS [Feng, 2018]
HSA-RNN [Zhao, 2018]
Method Reference
DQSN [Zhou, 2018a]
ACGANsup [He, 2019]
SUM-DeepLab [Rochan, 2018]
CSNetsup [Yuan, 2019a]
SMLD [Chu, 2019]
H-MAN [Liu, 2019]
VASNet [Fajtl, 2019]
SMN [Wang, 2019]
* SUM-GAN-AAE [Apostolidis, 2020]

 tbd
86
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Random summary 40.2 54.4 22.5
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ActionRanking 40.1 56.3 21.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
vsLSTM+Att 43.2 - 17
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Tessellationsup 37.2 63.4 15
dppLSTM+Att 43.8 - 14
WS-HRL 43.6 58.4 14
UnpairedVSNsup 48.0 56.1 13
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
ACGANsup 47.2 59.4 9
SUM-DeepLab 48.8 58.4 8
HSA-RNN 44.1 59.8 10
CSNetsup 48.6 58.5 8
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5

 tbd
87
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
WS-HRL 43.6 58.4 14
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
HSA-RNN 44.1 59.8 10
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5

 tbd
88
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
WS-HRL 43.6 58.4 14
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
HSA-RNN 44.1 59.8 10
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5

 tbd
89
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
WS-HRL 43.6 58.4 14
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
HSA-RNN 44.1 59.8 10
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5

 tbd
90
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
WS-HRL 43.6 58.4 14
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
HSA-RNN 44.1 59.8 10
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5

 tbd
91
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
WS-HRL 43.6 58.4 14
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
HSA-RNN 44.1 59.8 10
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5

Quantitative comparison
92
Keyframe-based overview of video #15 of TVSum (1 keyframe / shot)

93
Generated summaries by five summarization methods

94

95

96

97
Video #15 of TVSum: “How to Clean Your Dog’s Ears - Vetoquinol USA

98
Automatically generated summaries
VASNet SUM-GAN-AAE DR-DSN

Use of video summarization technologies
99
Tool for content adaptation / re-purposing
 Developed by CERTH-ITI
 Elaborates GAN-based methods for unsupervised
learning [Apostolidis 2019, 2020]
 Enables content adaptation for distribution via
multiple communication channels
 Faciliates summary creation based on the audience
needs for: Twitter, Facebook (feed & stories),
Instagram (feed & stories), YouTube, TikTok
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.

100
Tool for content adaptation / re-purposing
 Learns content-specific summarization
 Separate models can be trained and used for
different video content (e.g. TV shows)
 Creating these models does not require manually-
generated training data (it’s (almost) for free)
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.

101
Tool for content adaption / re-purposing
 Try it with your video at: http://multimedia2.iti.gr/videosummarization/service/start.html
 Demo video: https://youtu.be/LbjPLJzeNII

Future directions
102
 Unsupervised video summarization based on combining adversarial and reinforcement
learning
 Advanced attention mechanisms and memory networks for capturing long-range temporal
dependencies among parts of the video
 Exploiting augmented/extended training data
 Introducing editorial rules in unsupervised video summarization
 Examine the potential of transfer learning in video summarization
Analysis-oriented

Future directions
103
 There is a lack of integrated technologies for automating video summarization and CERTH’s
web application is one of the first complete tools
 Automated summarization that is adaptive to the distribution channel / targeted audience or
the video content has a strong potential!
 Further applications of video summarization should be investigated by:
 monitoring the modern media/social media ecosystem
 identifying new application domains for content adaptation / re-purposing
 translating the needs of these application domains into analysis requirements
Application-oriented

[Apostolidis, 2019] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, “A stepwise, label-based approach for
improving the adversarial training in unsupervised video summarization,” in Proc. of the 1st Int. Workshop on AI for Smart TV
Content Production, Access and Delivery, ser. AI4TV ’19. New York, NY, USA: ACM, 2019, pp. 17–25.
[Apostolidis, 2020] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, “Unsupervised video summarization via
attention-driven adversarial learning,” in Proc. of the Int. Conf. on Multimedia Modeling. Springer, 2020, pp. 492–504.
[Bahdanau, 2015] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in
Proc. of the 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
[Chen 2019] Y. Chen, L. Tao, X. Wang, and T. Yamasaki, “Weakly supervised video summarization by hierarchical reinforcement
learning,” in Proc. of the ACM Multimedia Asia, 2019, pp. 1–6.
[Cho, 2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–
decoder approaches,” in Proc. of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111.
[Chu, 2019] W.-T. Chu and Y.-H. Liu, “Spatiotemporal modeling and label distribution learning for video summarization,” in Proc.
of the 2019 IEEE 21st Int. Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019, pp. 1–6.
[Elfeki, 2019] M. Elfeki and A. Borji, “Video summarization via actionness ranking,” in Proc. of the IEEE Winter Conference on
Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, Jan 2019, pp. 754–763.
Key references
104

[Fajtl, 2019] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Summarizing videos with attention,” in Asian
Conf. on Computer Vision (ACCV) 2019 Workshops, G. Carneiro and S. You, Eds. Cham: Springer International Publishing,
2019, pp. 39–54.
[Feng, 2018] L. Feng, Z. Li, Z. Kuang, and W. Zhang, “Extractive video summarizer with memory augmented neural networks,” in
Proc. of the 26th ACM Int. Conf. on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 976–983.
[Fu, 2019] T. Fu, S. Tai, and H. Chen, “Attentive and adversarial learning for video summarization,” in Proc. of the IEEE Winter
Conf. on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1579–1587.
[Gonuguntla, 2019] N. Gonuguntla, B. Mandal, N. Puhan et al., “Enhanced deep video summarization network,” in Proc. of the
2019 British Machine Vision Conference (BMVC), 2019.
[Goyal, 2017] A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. J. Pal, J. Pineau, and Y. Bengio, “Actual: Actor-critic under adversarial
learning,” ArXiv, vol. abs/1711.04755, 2017.
[Gygli, 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 505–520.
[Gygli, 2015] M. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc.
of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3090–3098.
[Haarnoja, 2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor,” in Proc. of the 35th Int. Conf. on Machine Learning (ICML), 2018.
Key references
105

[He, 2019] X. He, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan, “Unsupervised video summarization with
attentive conditional generative adversarial networks,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New
York, NY, USA: ACM, 2019, pp. 2296–2304.
[Hochreiter, 1997] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[Huang, 2020] C. Huang and H. Wang, “A novel key-frames selection framework for comprehensive video summarization,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 577–589, 2020.
[Ji, 2019] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” IEEE
Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019.
[Jung, 2019] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative feature learning for unsupervised video
summarization,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8537–8544.
[Kaufman, 2017] D. Kaufman, G. Levi, T. Hassner, and L. Wolf, “Temporal tessellation: A unified approach for video analysis,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 94–104.
[Kulesza, 2012] A. Kulesza and B. Taskar, Determinantal Point Processes for Machine Learning. Hanover, MA, USA: Now
Publishers Inc., 2012.
[Lal, 2019] S. Lal, S. Duggal, and I. Sreedevi, “Online video summarization: Predicting future to better summarize present,” in
Proc. of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 471–480.
Key references
106

[Lebron Casas, 2019] L. Lebron Casas and E. Koblents, “Video summarization with LSTM and deep attention models,” in
MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham: Springer
[Liu, 2019] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, “Learning hierarchical self-attention for video
summarization,” in Proc. of the 2019 IEEE Int. Conf. on Image Processing (ICIP). IEEE, 2019, pp. 3377–3381.
[Mahasseni, 2017] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM
networks,” in Proc. of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2982–
2991.
[Otani, 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil¨a, and N. Yokoya, “Video summarization using deep semantic
features,” in Proc. of the 13th Asian Conference on Computer Vision (ACCV’16), 2016.
[Panda, 2017] R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, “Weakly supervised summarization of web videos,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 3677–3686.
[Pfau, 2016] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” in NIPS Workshop
on Adversarial Training, 2016.
[Potapov, 2014] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
Key references
107

[Rochan, 2018] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully convolutional sequence networks,” in Proc. of
the European Conference on Computer Vision (ECCV) 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham:
Springer International Publishing, 2018, pp. 358–374.
[Rochan, 2019] M. Rochan and Y. Wang, “Video summarization by learning from unpaired data,” in Proc. of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[Savioli, 2019] N. Savioli, “A hybrid approach between adversarial generative networks and actor-critic policy gradient for low
rate high-resolution image compression,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019.
[Smith, 2017] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie
Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–
1808.
[Song, 2015] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TvSUM: Summarizing web videos using titles,” in Proc. of the 2015
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5179–5187.
[Song, 2016] X. Song, K. Chen, J. Lei, L. Sun, Z. Wang, L. Xie, and M. Song, “Category driven deep recurrent neural network for
video summarization,” in Proc. of the 2016 IEEE Int. Conf. on Multimedia Expo Workshops (ICMEW), July 2016, pp. 1–6.
[Szegedy, 2015] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich, “Going deeper with convolutions,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2015, pp. 1–9.
Key references
108

[Vinyals, 2015] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in Neural Information Processing Systems
28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2692–2700.
[Wang, 2019] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, and T. Tan, “Stacked memory network for video summarization,” in
Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 836–844.
[Wang, 2016] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good
practices for deep action recognition,” in Proc. of the European Conference on Computer Vision – ECCV 2016, B. Leibe, J.
Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 20–36.
[Wei, 2018] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization via semantic attended networks,” in Proc. of
the 2018 AAAI Conf. on Artificial Intelligence (AAAI), 2018.
[Yu, 2017] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. of
the 2017 AAAI Conf. on Artificial Intelligence, ser. (AAAI). AAAI Press, 2017, pp. 2852–2858.
[Yuan, 2019a] L. Yuan, F. E. H. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-SUM: Cycle-consistent adversarial lstm networks for
unsupervised video summarization,” in Proc. of the 2019 AAAI Conf. on Artificial Intelligence (AAAI), 2019.
[Yuan, 2019b] Y. Yuan, T. Mei, P. Cui, and W. Zhu, “Video summarization by learning deep side semantic embedding,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, Jan 2019.
[Yuan, 2019c] Y. Yuan, H. Li, and Q. Wang, “Spatiotemporal modeling for video summarization using convolutional recurrent
neural network,” IEEE Access, vol. 7, pp. 64 676–64 685, 2019.
Key references
109

[Zhang, 2016a] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video
summarization,” in Proc. of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
1059–1067.
[Zhang, 2016b] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. of
the European Conference on Computer Vision (ECCV) 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer
[Zhang, 2018] Y. Zhang, X. Liang, D. Zhang, M. Tan, and E. P. Xing, “Unsupervised object-level video summarization with online
motion auto-encoder,” Pattern Recognition Letters, 2018.
[Zhang, 2019] Y. Zhang, M. Kampffmeyer, X. Zhao, and M. Tan, “DTR-GAN: Dilated temporal relational adversarial network for
video summarization,” in Proc. of the ACM Turing Celebration Conference - China, ser. ACM TURC ’19. New York, NY, USA:
ACM, 2019, pp. 89:1–89:6.
[Zhao, 2017] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network for video summarization,” in Proc. of the 2017 ACM
on Multimedia Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 863–871.
[Zhao, 2018] B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive RNN for video summarization,” in Proc. of the
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–7414.), 2018.
[Zhao, 2019] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video summarization,” IEEE Transactions on
Neural Networks and Learning Systems, 2019.
Key references
110

[Zhou, 2018a] K. Zhou, T. Xiang, and A. Cavallaro, “Video summarisation by classification with deep reinforcement learning,” in
Proc. of the 2018 British Machine Vision Conference (BMVC), 2018.
[Zhou, 2018b] K. Zhou and Y. Qiao, “Deep reinforcement learning for unsupervised video summarization with diversity-
representativeness reward,” in Proc. of the 2018 AAAI Conference on Artificial Intelligence (AAAI), 2018.
Key references
111

Vasileios Mezaris
bmezaris@iti.gr
apostolid@iti.gr
CERTH-ITI, Greece
info@retv-project.eu
This work has received funding from the
European Union’s Horizon 2020 research
and innovation programme under grant
agreement H2020-780656 ReTV
Questions?
Following the Q&A session and the
break, we will be back with Part II of
the tutorial, on video summaries re-
use and recommendation

Icme2020 tutorial video_summarization_part1

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Icme2020 tutorial video_summarization_part1

Similar to Icme2020 tutorial video_summarization_part1 (20)

More from VasileiosMezaris

More from VasileiosMezaris (20)

Recently uploaded

Recently uploaded (20)

Icme2020 tutorial video_summarization_part1