Mtech Second progresspresentation ON VIDEO SUMMARIZATION

Second Progress Presentation on
“Video Summarization”
Presented By:
Neeraj Baghel
M. Tech.(P) (CSE) II Yr
178150005
Supervised By:
Prof. Charul Bhatnagar
Professor, Deptt. of CEA
GLA University, Mathura
Dept. of Computer Engineering & Applications,
GLA University, Mathura.
October 24, 2018
1

Outline
• Introduction
• Literature Survey
• Research Gap
• Challenges
• Problem Statement
• Dataset
• Tools
• Conclusion
• References
2

Introduction To Video Summarization
Video
• Video data is a great asset for
information extraction and
knowledge discovery.
• Due to its size an variability, it is
extremely hard for users to
monitor.[5]
Video Summarization
• Intelligent video summarization
algorithms allow us to quickly
browse a lengthy video by
capturing the essence and
removing redundant
information.[5]
Fig 1: Video Summarization Work Flow [1]
3

Types of Video summarization
Video can be summarized by two different ways which are as follows.
Fig 2: Video Summarization Technique Classification [7]
4

Literature Survey
On:
“Video Summarization”
5

Paper 1: Tvsum: Summarizing web videos using
titles [2].
Video summarization is a
challenging problem in part
because knowing which
part of a video is important
requires prior knowledge
about its main topic. We
present TVSum, an
unsupervised video
summarization framework
that uses title-based image
search results to ﬁnd
visually important shots.[2]
Authors- Yale Song, Jordi Vallmitjana, Amanda Stent, Alejandro Jaimes
Yahoo Labs, New York
IEEE Conference on Computer Vision and Pattern Recognition. 2015
Fig. 1. -Figure 1. An illustration of
title-based video
summarization.[2]
6

Objectiv
e
Proposed
Method
Dataset Strengt
h
Limitatio
n
1. To find
which part of a
video is
important. And
thus “summary
worthy,”
requires prior
knowledge
about its main
topic.
2. Proposed
TVSum ,an
unsupervised video
summarization
framework that
uses the video title
to ﬁnd visually
important shots.
2. Author devel-
oped co-archetypal
analysis technique
that learns
canonical visual
concepts shared
between video and
images
1. TVSum50
dataset
2. SumMe
dataset
TVSum
unsupervised
video summa-
rization
framework that
uses the video
title to ﬁnd
visually
important shots
1) Titles are
free-formed,
unconstrain
ed, and
often
written
ambiguousl
y,
2) How to
learn all
titles text.
7

Paper 2: Query-Focused Video Summarization: Dataset,
Evaluation, and A Memory Network Based Approach [5].
One of the main obstacles to the research
on video summarization is the user
subjectivity — users have various
preferences over the summaries. The
subjectiveness causes at least two
problems. First, no single video
summarizer ﬁts all users unless it interacts
with and adapts to the individual users.
Second, it is very challenging to evaluate
theperformanceofavideosummarizer..[5]
Aidean SharghiA, Jacob S. LaurelB, and Boqing GongA
A University of Central Florida, Orlando
B University of Alabama at Birmingham
Fig. 2- Comparing the semantic
information captured by 48 captions and
by the concept tags we collected.[8]8

Objective Propose
d
Method
Dataset Strengt
h
Limitatio
n
Main obstacles to
the research on
video
summarization is
the user
subjectivity--users
have various
preferences over
the summaries.
Auhtor propose
a memory net-
work
parameterized
sequential
determinantal
point process
in order to attend
the user query
onto different
video frames
and shots.
1. UTEgocentri
c (UTE)
dataset
1)Introduces
user
preferences
in the form of
text queries
2) Author
collect dense
per-video-
shot concept
annotations
1)Collecting
dense per-
video-shot
concept annota-
Tions
9

Paper 3: Query-Conditioned Three-Player Adversarial
Network for Video Summarization [9].
Video summarization plays an important role in video understanding by selecting key
frames/shots.Traditionally,itaimstoﬁndthemostrepresentativeanddiversecontentsinavideoas
short summaries. In this paper, Author propose a query-conditioned three-player generative
adversarialnetworktotacklethischallenge.Thegeneratorlearnsthejointrepresentationoftheuser
query and the video content, and the discriminator takes three pairs of query-conditioned
summariesastheinputtodiscriminatetherealsummaryfromageneratedandarandomone.[9]
Yujia Zhang12, Michael Kampffmeyer3, Xiaodan Liang4, Min Tan12, Eric P. Xing4
1 Institute of Automation,Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 UiT The Arctic University of Norway
4 Carnegie Mellon University
Fig 3. Different video summarization10

Objectiv
e
Proposed
Method
Dataset Strengt
h
Limitatio
n
Main aims to
ﬁnd the most
representative
and diverse
contents
in a video as
short
summaries.
Author propose a
query-conditioned
three-player
generative
adversarial network
to tackle this
challenge. The
generator learns
the joint
representation of
the user query and
the video content,
1. UTEgocentr
ic (UTE)
dataset
Results are
more accurate
based on user
query
1)Do not
randomly
generated
summary
11

Paper 4: Hierarchical Structure-Adaptive RNN for
Video Summarization [10].
The video data follow a hierarchical
structure, a video is composed of
shots, and a shot is composed of
several frames. While few existing
summarization approaches pay
attention to the shot segmentation
procedure. They generate shots by
some trivial strategies, such as ﬁxed
length segmentation, which may
destroy the underlying hierarchical
structure of video data and further
reduce the quality of generated
summaries.[10]
Authors- Bin Zhao1, Xuelong Li2, Xiaoqiang Lu2
1 Northwestern Polytechnical University, Shaanxi, P. R. China
2 Chinese Academy of Sciences, Shaanxi, P. R. China
Fig. 4- The diagram of the proposed
HSA-RNN, where Layer 1 and Layer 2
are designed to exploit the video
structure and generate the video
summary[10]
12

Objective Propose
d
Method
Dataset Strengt
h
Limitatio
n
To make the
underlying
hierarchical
structure of
video data and
further improve
the quality of
generated
summaries.
Author propose
a structure-
adaptive video
summarization
approach that
integrates shot
segmentation
and video
summarization
into a
Hierarchical
Structure-
Adaptive RNN
1. SumMe
dataset
2. TVSum
dataset
1) Use
hierarchical
structure of
video data
improve the
quality of
generated
summaries.
1) Results are
not based on
user
subjectivity
13

Paper 5: Unsupervised object-level video summarization
with online motion auto-encoder [11].
Unsupervised video summarization
plays an important role on
digesting, browsing, and searching
the ever-growing videos every day,
and the underlying ﬁne-grained
semantic and motion information
(i.e., objects of interest and their
key motions) in online videos has
been barely touched.[11]
Authors-Yujia ZhangA,Xiaodan LiangB,Dingwen ZhangC,Min TanA,Eric P.XingB
A University of Chinese Academy of Sciences, Beijing, China
B Carnegie Mellon University, Pittsburgh, PA, USA.
C Xidian university, Xi’an, China
Fig. 4- Different types of video
summarization techniques.[11]
14

Objective Propose
d
Method
Dataset Strengt
h
Limitatio
n
To extract key
motions
of participated
objects and
learning to
summarize in an
unsupervised and
online manner
Author propose
a novel online
motion Auto-
Encoder (online
motion-AE)
framework that
functions on the
super-segmented
object motion
clips.
1) OrangeVille
2) Base jumping
dataset from
public CoSum
dataset
1)Video
Summarized
based on
moving
object
instances.
2) Tracking
of each
moving
object.
1)Tracking of
too many
moving object
in a high speed.
It willvery
complex.
3) Results are
not based on
user
subjectivity
15

Research Gap
 Finding a title based video summarization where Titles are free-formed,
often written ambiguously having unsupervised learning of titles text.
 Collecting dense annotations of per-video-shot using learning
algorithms.
 Finding HSA RNN for Video Summarization based on user subjectivity
 Finding Unsupervised object-level video summarization with online
motion auto-encoder with user subjectivity
 Finding key frame based on extracted text and assign a weight to frame.
16

Challenges
Some Challenges related to video summarization:
 learning of all titles text.
 Accuracy of object learning algorithms.
 Assigning weight for extracted text.
 Recovering Loss of information
 Computationally expensive
 Evaluate the performance of a video summarizer
 No single video summarizer fits all users
17

Problem Statement
“Finding key frame based on
extracted text and assign a
weight to frame”
18

Datasets
 UT Egocnetric (UTE) [5]
The dataset contains 4 videos from head-mounted cameras, each about
3-5 hours long. (Size: 1.4Gb)
 SumMe [12]
The dataset consists of 25 videos which are single-shot and range in
length from 1-6 minutes. The dataset contains summaries created by 15
to 18 users with the constraint in length being that the summaries
should be 5% to 15% of the original video. (Size: 2.2 GB)
19

Datasets Cont…
 YouTube-8M [2]
YouTube-8M is a large-scale labeled video dataset that consists of
millions of YouTube video IDs and associated labels from a diverse
vocabulary of 4700+ visual entities
 Each video must be public and have at least 1000 views
 Each video must be between 120 and 500 seconds long
 Each video must be associated with at least one entity from our target
vocabulary
 Adult & sensitive content is removed (as determined by automated
classifiers)
May 2018 version (current): 6.1M videos, 3862 classes, 3.0 labels/video,
2.6B audio-visual features
20

Tools
 Matlab
Matlab is a commercial product that is pretty widely-used in the image
video processing community. It also has an adequate image processing
`toolbox,' and toolboxes for things like Kalman filters, neural networks,
genetic algorithms, and so on. It runs on most Unices, including Linux,
and on Windows 95/NT. For people who are researching into vision
algorithms, the lack of source code is a killer.
 OpenCV
OpenCV is a library of programming functions mainly aimed at real-
time computer vision. Originally developed by Intel. The library is
cross-platform and free for use under the open-source BSD license.
21

Tools Cont…
 Python
Python is an interpreted high-level programming language for general-
purpose programming. Created by Guido van Rossum and first released
in 1991, Python has a design philosophy that emphasizes code
readability, notably using significant whitespace
22

Conclusion:
23
 The Text retrieval can be used to assign the weight for a frame and
that can be used as one more feature for generating video
summary.

References:
1. https://www.slideshare.net/MikolajLeszczuk/results-on-video-summarization
(D.L.V 01/09/18)
2. Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web
videos using titles,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 5179–5187, 2015
3. Y. Zhuang, R. Xiao, and F. Wu, “Key issues in video summarization and its
application,” in Information, Communications and Signal Processing, 2003
and Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003
Joint Conference of the Fourth International Conference on, vol. 1, pp. 448–
452, IEEE, 2003
4. R. Kansagara, D. Thakore, and M. Joshi, “A study on video summarization
tech-niques,” International Journal of Innovative Research in Computer and
Communication engineering, vol. 2, 2014.
5. A. Sharghi, J. S. Laurel, and B. Gong, “Query-focused video summarization:
Dataset, evaluation, and a memory network based approach,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),pp. 2127–
2136, 201724

References Cont…
6. P. Mundur, Y. Rao, and Y. Yesha, “Keyframe-based video summarization using delaunay
clustering,” International Journal on Digital Libraries, vol. 6, no. 2, pp. 219–232, 2006
7. M,Padmavathi, Y. Rao, and Y. Yesha. "Keyframe-based video summarization using
Delaunay clustering." International Journal on Digital Libraries 6.2 (2006): 219-232.
8. S. Yeung, A. Fathi, and L. Fei-Fei. Videoset: Video summary evaluation through text.
arXiv preprint arXiv:1406.5824, 2014. 1, 2, 3, 4, 5, 8
9. Y. Zhang, M. Kampffmeyer, X. Liang, M. Tan, and E. P. Xing, “Query-conditioned
three-player adversarial network for video summarization,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018.
10. B.Zhao, X.Li, & X.Lu,. HSA-RNN: Hierarchical Structure-Adaptive RNN for Video
Summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 7405-7414) (2018).
11. Y.Zhang, X.Liang, D.Zhang, M.Tan, & E.P Xing Unsupervised Object-Level Video
Summarization with Online Motion Auto-Encoder. arXiv preprint
arXiv:1801.00543.(2018)
12. M.Gygli, H.Grabner, H.Riemenschneider, & L. Van Gool. Creating summaries from user
videos. In European conference on computer vision (pp. 505-520). Springer,
Cham.(2014)
25

Mtech Second progresspresentation ON VIDEO SUMMARIZATION

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mtech Second progresspresentation ON VIDEO SUMMARIZATION

Similar to Mtech Second progresspresentation ON VIDEO SUMMARIZATION (20)

More from NEERAJ BAGHEL

More from NEERAJ BAGHEL (13)

Recently uploaded

Recently uploaded (20)

Mtech Second progresspresentation ON VIDEO SUMMARIZATION

Editor's Notes