In [1], each capsule uses a activity vector to represent different instantiation parameters (position, size, orientation, thickness, … etc.), with the vector length (norm) representing the probability of the presence of an entity
Hence, the output vector for each capsule need to be normalized to 0,1
This is done by the non-linear squashing function:
2. This paper proposes a Discriminative Latent Semantic Graph
(D-LSG) framework to generate natural language captions
that can summarize the visual contents in long videos. The
model has three main components:
• A conditional graph is used to enhance object proposals
by fusing contextual information from the video frames
• A dynamic graph aggregates the enhanced proposals into
compact visual words with higher semantic meaning
• A discriminative module validates the generated captions
by reconstructing visual words and scoring against the
original visual words to ensure fidelity and relevance. The
model can effectively leverage complex object
interactions, extract salient visual concepts from videos,
and generate captions that are content-relevant.
3. Video captioning aims to use natural language descriptions
to summarize the visual contents in video data. This is a
challenging task as it requires:
• Modeling complex dependencies between objects and
their interactions
• Extracting high-level visual concepts from
spatio-temporal video data
• Generating captions that accurately reflect visual
content and are semantically rich
4. Sr. No. Title Year Authors Methodology
Feature Extraction
Techinique
Classifier Accuracy Issues Research Gap
2
Video Joint Modelling Based on Hierarchical
Transformer for Co-summarization
2022
Haopeng Li, Qiuhong Ke, Mingming
Gong, and Rui Zhang
ML
GoogLeNet(CNN), Video
Joint Modelling based on
Hierarchical Transformer
(VJMHT)
Transformer-based models (F-
Transformer and S-Transformer) for
modeling intra-shot and inter-shot
dependencies in the video
summarization process.
80%
Low F-Measure or Rank
Correlation, Long Training and
Inference Times:
Need for Generalization Across Diverse
Datasets &Handling Long Videos and
Temporal Context
3
Video summarization using deep learning techniques:
a detailed analysis and investigation
2023
Parul Saini, Krishan Kumar, Shamal
Kashid, Ashray Saini, Alok Negi
Deep Learning 3D CNN,FCNN,DG-CNN
K-Nearest Neighbors (K-NN),Deep
Belief Network (DBN)
88%
Some GAN-based models may
produce very short summaries that
lack important details making it
challenging to strike the right
balance.
additional efforts be put into video
summarization algorithms for optimizing
the best summaries based on the
intended audience
4 Semantic Text Summarization of Long Videos 2017
Shagan Sah, Sourabh Kulhare, Allison
Gray, Subhashini Venugopalan, Emily
Prud’hommeaux, Raymond Ptucha
Deep Learning, Neural
Network
3D CNN
deep visual-captioning techniques for
feature extraction in video
summarization
70%
annotated ground truth data for
semantic video summarization may
be limited or expensive to obtain,
hindering supervised training of
CNN-based models.
Enhancing CNNs' semantic
understanding of video content, is
essential. explore techniques that allow
CNNs to recognize actions, and
relationships within video frames.
5
Towards Diverse Paragraph Captioning for
Untrimmed Videos
2021
Yuqing Song1 , Shizhe Chen, Qin Jin,
Renmin University of China,
INRIA
Machine Learning,
Reinforcement Learning
ResNet, VGG MFT,Vtransformer,AdvInf,MART 79%
the vanilla encoder brings
computation burden for long
paragraph generation,both MLE
and RL training make the model
generate high-frequency words and
phrases.
Scalability to Longer Videos,Training
with Limited Data,Evaluation Metrics
Beyond State-of-the-Art:
6
A Comprehensive Review of the Video-to-Text
Problem
2021
Jesus Perez-Martin, Benjamin Bustos,
Silvio Jamil, F. Guimaraes, Ivan Sipiran,
Jorge P´erez , Grethel Coello
ML,DL
2D CNN,NLG (Natural
Language Generator), RNN
AlexNet,ImageNet,ILSVRC 71%
An essential issue for exactly and
precise video description
generation is the selection of the
most informative frames.
Model Adaptation:Fine-tuning pre-
trained AlexNet on ImageNet may not
always lead to optimal results for
specific tasks
Content Variability, Comparative
Analysis
Limited Mention of Multimodal
Integration: researchers can choose or
create datasets that inherently require
the integration of both visual and textual
information
Discriminative Latent Semantic Graph for Video
Captioning
1 2021
Yang Bai1, Junyan Wang2,Yang Long3
,Bingzhang Hu4, Yang Song2,Maurice
Pagnucco 2,Yu Guan1
Squence to Sequence
Model , Deep Learning
2D CNNs, Faster R-
CNN,LSTM models
Language LSTM,
Multimodal bilinear pooling
70%
5. Sr. No. Title Year Authors Methodology
Feature Extraction
Techinique
Classifier Accuracy Issues Research Gap
7
Real Time Video to Text
Summarization using Neural Network
2020
Abhishek Yadav, Anjali Vishwakarma,
Shyama Panickar, and Prof. Satish
Kuchiwale.
Deep Learning
Convolutional Neural
Network
RNN,SoftMax
layer
75%
Training RNNs for video
summarization can suffer from the
vanishing gradient problem, where
gradients become too large. This
can impact training stability and
convergence.
Research should aim to develop
effective regularization techniques and
architectural innovations to mitigate
overfitting in RNN-based video
summarization models.
8
Video Summarization by Learning
Deep Side Semantic Embedding
2019
Yitin Yuan, Taon Mei, Senior Member
IEEE, Peng Cui and Wenwu Zhu
Deep Learning 3D-CNN DSSE Model 80%
effectively measuring the semantic
relevance between video frames
and query information
Deep Side Semantic Embedding
(DSSE) model to address these issues
by leveraging side information
to select semantically meaningful
segments from videos
9
Spatiotemporal Modeling for Video
Summarization Using Convolutional
Recurrent Neural Network
2019 Yuan Yuan, Haopeng LI, QI WANG Deep Learning 2D-CNN,DCNNs AlexNet and GoogLeNet 85%
the increasing amount of video
data,the difficulty of retrieving
valuable information conveyed by
videos and the extremely heavy
burden of data storage
improving computational
efficiency,further research into enabling
real time. especially for applications that
require rapid summarization
10
Text Semantics Based Automatic Summarization
for Chinese Videos.
2015
WANG Xingqi, ZHA Taotao, WU
Chunming, FANG Jinglong.
ML HLAC, HOG Ant Colony - Broad Range of content
There has been no attempt for text
semantic-based video summarization
prior to their proposed method.
11
Video and Text Summarization
Using VDAN and RNN
2021
Joys Princia A, Ms. J Sangeetha Priya,
Kalai Selvi J, Rithi Afra J
Deep Learning and
Neural Network
VDAN Random Forest -
visual gaps and breaks
between frames
Short-term dependencies of simple
RNNS
6. The key problems this model aims to address:
• Current video captioning models cannot effectively leverage complex object-level
interactions and relationships in the video data.
• They fail to extract high-level visual concepts that capture salient information from
spatio-temporal video data.
• Existing models struggle to validate the fidelity and relevance of generated captions to
the source video's visual content.
9. 1. Literature Review
• Survey prior work in video captioning and summarization
• Understand limitations of existing methods
• Identify opportunities for improvement
2. Problem Definition
• Clearly define the problem to be solved
• Set project objectives and scope
3. Data Collection
• Gather relevant video datasets for training and testing
• Ensure diversity of video content.
4. Model Development
• Implement base encoder-decoder architecture
• Incorporate conditional graph for enhancing object proposals
• Develop dynamic graph for latent proposal aggregation
5. Training and Optimization
• Prepare training data and protocols
• Train model end-to-end with suitable loss functions
10. Following can be the future scopes or possible applications of the
model introduced:
1.Video Search and Retrieval
• Generate textual captions to index video content
• Enable text-based semantic search of video database
2. Video Highlight Detection
• Identify key moments and events in long videos
• Generate concise summaries for skimming videos
3. Law Enforcement:
• Scan and index video evidence from body-worn cameras
• Surface video segments containing threats, violations etc.
• Assist investigators in reviewing large volumes of footage
4. Multi-lingual Subtitling
• The generated text can be translated to create multi-lingual
subtitles and aid localization of video content.
11. [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for
image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.
[2] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 6299–6308.
[3] David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies. 190–200.
[4] Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV. 358–373.
[5] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In ICCV. 2970–2979.
[6] Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth
workshop on statistical machine translation. 376–380.
[7] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013.
Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712–2719.
[8] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in
neural information processing systems. 5767–5777.
[9] Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, and Jiebo Luo. 2020. Joint Commonsense and Relation Reasoning for Image and Video
Captioning.. In AAAI. 10973–10980.
[10] Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the
27th ACM International Conference on Multimedia. 774–783.
[11] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).