SlideShare a Scribd company logo
Title of presentation
Subtitle
Name of presenter
Date
Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation
Using a New Frame Selection Policy and Gating Mechanism
Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy, Dec. 2022
2
• The recognition of high-level events in unconstrained video is an important topic
with applications in: security (e.g. “making a bomb”), automotive industry (e.g.
“pedestrian crossing the street”), etc.
• Most approaches are top-down: “patchify” the frame (context agnostic); use
label and loss function to learn focusing on frame regions related with event
• Bottom-up approaches: use an object detector, feature extractor and graph
network to extract and process features from the main objects in the video
Introduction
Video event
“walking the dog”
3
• Our recent bottom-up approach with SOTA performance in many datasets
• Uses a graph attention network (GAT) head to process local (object) & global
(frame) information
• Also provides frame/object-level explanations (in contrast to top-down ones)
Video event
“removing ice from
car” miscategorized
as “shoveling snow”
Object-level
explanation:
classifier does
not focus on the
car object
ViGAT
4
• Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s
nodes) to a feature vector (representing the whole graph)
• Computes explanation significance (weighted in-degrees, WiDs) of each node
using the graph’s adjacency matrix
Attention
Mechanism
GAT head Graph pooling
X (K x F) A (K x K) Ζ (K x F) η (1 x F)
𝝋𝒍 =
𝒌=𝟏
𝑲
𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲
Computation of
Attention matrix from
node features; and
Adjacency Matrix using
attention coefficients
Multiplication of
node features with
Adjacency Matrix
Production of vector-
representation of the graph
WiDs: Explanation
significance of l-th node
ViGAT block
ω2
ω2
5
K
K objects
object-level
features
b
frame-level
local features
P
ω2
P
P
P
ω3
b
frame-level
global features
P
ω1 concat u
video
feature
o
video frames
video-level
global feature
mean
video-level
local feature
K
frame WiDs
(local info)
frame WiDs
(global info)
object WiDs
P
P
P
Recognized Event: Playing
beach volleyball!
Explanation: Event supporting
frames and objects
ViGAT architecture
max3
max
o: object detector
b: feature extractor
u: classification head
GAT blocks: ω1, ω2, ω3
Global branch: ω1
Local branch: ω2, ω3
Local information
Global information
6
• ViGAT has high computational cost due to local (object) information processing
(e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video)
• Efficient video processing has investigated at the top-down (frame) paradigm:
- Frame selection policy: identify most important frames for classification
- Gating component: stop processing frames when sufficient evidence achieved
• Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce
the computational complexity in the local processing pipeline of ViGAT?
ViGAT
7
K
b
P
Q
ω3
concat
u
video
feature
o
Extracted
video frames
mean
video-level
local feature
K
Frame WiDs
(local info)
Object WiDs
(local info)
Q(s)
Frame selection
policy
Q(s) Q(s)
Q(s)
Q(s)
Q(s)
g(s)
ON/OFF
concat
max
Explanation: Event supporting
frames and objects
Recognized Event: Playing
beach volleyball!
Computed
video-level
global feature
Computed
frame WiDs
(global info)
u1 uP
max3
Gate is closed: Request Q(s+1) - Q(s) additional frames
ζ(s)
ζ(1)
g(1)
g(S)
Z(s)
Gated-ViGAT
ω2
ω2
ω2
Local information processing pipeline
8
• Iterative algorithm to select Q frames
frame-level
global features
frame WiDs
(global info)
argmax
p1
minmax
minmax
αp = (1/2) (1 – γp
Τγpi-1
)
γp = γp /|γp|
γ1
γP
uP
u1
uP
u1
α1 αP
pi
argmax
u1 uP
up = αp up
u1 uP
α1 αP
1. Initialize
2. Select Q-1 frames
Input: Q, frame index p1, P feature vectors
Iterate for i= 2 to Q-1
γ1
γP
Gated-ViGAT: Frame selection policy
9
• Each gate has a GAT block-like structure and binary classification
head (open/close); corresponds to specified number of frames Q(s);
trained to provide 1 (i.e. open) when ViGAT loss is low; design
hyperparameters:Q(s) , β (sensitivity)
Use frame selection policy to select Q(s) frames for gate g(s)
Compute the video-level local feature ζ(s) (and Z(s))
Compute ViGAT classification loss: lce = CE(label, y)
Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise
Compute gate component loss: 𝐿 =
1
𝑆 𝑠=1
𝑆
𝑙𝑏𝑐𝑒(𝑔 𝑠
𝒁 𝑠
, 𝑜(𝑠)
)
Perform backpropagation to update gate weights
concat
u
video
feature
video-level
local feature
g(s)
concat
Computed
video-level
global feature
ζ(s)
ζ(1)
g(1)
g(S)
Local ViGAT
branch
Z(s)
Ground truth
label
cross
entropy
y
Binary cross
entropy
o(s)
Gated-ViGAT: Gate training
Select Q(s) video
frames for gate g(s)
Q
o
10
• ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel
• MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label
• Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics
• Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50
objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks
(pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics)
• Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths),
for ActivityNet/MiniKinetics
• Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35
• Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs
• Gated-ViGAT is compared against top-scoring methods in the two datasets
Experiments
11
Methods in MiniKinetics Top-1%
TBN [30] 69.5
BAT [7] 70.6
MARS (3D ResNet) [31] 72.8
Fast-S3D (Inception) [14] 78
ATFR (X3D-S) [18] 78
ATFR (R(2+1D)) [18] 78.2
RMS (SlowOnly) [28] 78.6
ATFR (I3D) [18] 78.8
Ada3D (I3D, Kinetics) [32] 79.2
ATFR (3D Resnet) [18] 79.3
CGNL (Modified ResNet) [17] 79.5
TCPNet (ResNet, Kinetics) [3] 80.7
LgNet (R3D) [3] 80.9
FrameExit (EfficientNet) [1] 75.3
ViGAT [9] 82.1
Gated-ViGAT (proposed) 81.3
• Gated-ViGAT outperforms all top-down approaches
• Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction
• As expected, has higher computational complexity than many top-down
approaches (e.g. see [3], [4]) but can provide explanations
Methods in ActivityNet mAP%
AdaFrame [21] 71.5
ListenToLook [23] 72.3
LiteEval [33] 72.7
SCSampler [25] 72.9
AR-Net [13] 73.8
FrameExit [1] 77.3
AR-Net (EfficientNet) [13] 79.7
MARL (ResNet, Kinetics) [22] 82.9
FrameExit (X3D-S) [1] 87.4
ViGAT [9] 88.1
Gated-ViGAT (proposed) 87.3
FLOPS in 2 datasets ViGAT Gated-ViGAT
ActivityNet 137.4 24.8
MiniKinetics 34.4 8.7
Experiments: results
*Best and second best performance
are denoted with bold and underline
12
• Computed # of videos processed and recognition performance for each gate
• Average number of frames for ActivityNet / MiniKinetics: 20 / 7
• Recognition rate drops with gate number increase; this behavior is more
clearly shown in ActivityNet (longer videos)
• Conclusion: more “easy” videos exit early, more “difficult” videos still difficult
to recognize even with many frames (similar conclusion with [1])
ActivityNet g(1) g(2) g(3) g(4) g(5) g(6)
# frames 9 12 16 20 25 30
# videos 793 651 722 502 535 1722
mAP% 99.8 94.5 93.8 92.7 86 71.6
MiniKinetics g(1) g(2) g(3) g(4) g(5)
# frames 2 4 6 8 10
# videos 179 686 1199 458 2477
Top-1% 84.9 83 81.1 84.9 80.7
Experiments: method insight
13
• Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first
gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT)
• Frames selected with the proposed policy, both explain recognition result and
provide diverse view of the video: help to recognize video with fewer frames
Bullfighting
Cricket
Experiments: examples
14
• Can also provide explanations at object-level (in contrast to top-down methods)
“Waterskiing” predicted
as “Making a sandwich”
“Playing accordion” predicted
as “Playing guitarra”
“Breakdancing” (correct prediction)
Experiments: examples
15
Policy / #frames 10 20 30
Random 83 85.5 86.5
WiD-based 84.9 86.1 86.9
Random on local 85.4 86.6 86.9
WiD-based on local 86.6 87.1 87.5
FrameExit policy 86.2 87.3 87.5
Proposed policy 86.7 87.3 87.6
Gated-ViGAT (proposed) 86.8 87.5 87.7
Experiments: ablation study on frame selection policies
• Comparison (mAP%) on ActivityNet
• Gated-ViGAT selects diverse frames with high explanation potential
• Proposed policy is second best (surpassing FrameExit [1], current SOTA)
Random: Θ frames selected randomly for local/global features
WiD-Based: Θ frames are selected using global WiDs
Random local: P frames derive global feature; Θ frames selected randomly
WiD-based local: P frames derive global feature; Θ frames using global WiDs
FrameExit policy: Θ frames are selected using policy in [1]
Proposed policy: P frames derive global feature; Θ selected using proposed
Gated-ViGAT: in addition to above gate component selects Θ frames in average
16
• Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy
Proposed
WiD-based
Updated
WiDs
Experiments: ablation study example
17
• An efficient bottom-up event recognition and explanation approach presented
• Utilizes a new policy algorithm to select frames that: a) explain best the
classifier’s decision, b) provide diverse information of the underlying event
• Utilizes a gating mechanism to instruct the model to stop extracting bottom-
up (object) information when sufficient evidence of the event is achieved
• Evaluation on 2 datasets provided competitive recognition performance and
approx. 5 times FLOPs reduction in comparison to previous SOTA
• Future work: investigations for further efficiency improvements, e.g.: faster
object detector, feature extractor, frame selection also for the global
information pipeline, etc.
Conclusions
18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/Gated-ViGAT
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreement 101021866 CRiTERIA

More Related Content

Similar to Gated-ViGAT

IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
Christopher Diamantopoulos
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
Qualcomm Research
 
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,..."Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
Edge AI and Vision Alliance
 
State of GeoServer - FOSS4G 2016
State of GeoServer - FOSS4G 2016State of GeoServer - FOSS4G 2016
State of GeoServer - FOSS4G 2016
GeoSolutions
 
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMMExtending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
Jan Jongboom
 
NGIoT standardisation workshops_Jens Hagemeyer presentation
NGIoT standardisation workshops_Jens Hagemeyer presentationNGIoT standardisation workshops_Jens Hagemeyer presentation
NGIoT standardisation workshops_Jens Hagemeyer presentation
VEDLIoT Project
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
State of GeoServer
State of GeoServerState of GeoServer
State of GeoServer
Jody Garnett
 
Tech 2 Tech: Network performance
Tech 2 Tech: Network performanceTech 2 Tech: Network performance
Tech 2 Tech: Network performance
Jisc
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
Edge AI and Vision Alliance
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
VasileiosMezaris
 
Introduction to Fog
Introduction to FogIntroduction to Fog
Introduction to Fog
Cisco DevNet
 
Cisco Multi-Service FAN Solution
Cisco Multi-Service FAN SolutionCisco Multi-Service FAN Solution
Cisco Multi-Service FAN Solution
Cisco DevNet
 
Nobuya Okada presentation
Nobuya Okada presentationNobuya Okada presentation
Nobuya Okada presentation
kazu_papasan
 
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Motaz Sabri
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
Shien-Chun Luo
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
LEGATO project
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P..."Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
Edge AI and Vision Alliance
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
NECST Lab @ Politecnico di Milano
 
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
PIT Overload Analysis in Content Centric Networks - Slides ICN '13PIT Overload Analysis in Content Centric Networks - Slides ICN '13
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
Matteo Virgilio
 

Similar to Gated-ViGAT (20)

IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
 
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,..."Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
 
State of GeoServer - FOSS4G 2016
State of GeoServer - FOSS4G 2016State of GeoServer - FOSS4G 2016
State of GeoServer - FOSS4G 2016
 
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMMExtending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
 
NGIoT standardisation workshops_Jens Hagemeyer presentation
NGIoT standardisation workshops_Jens Hagemeyer presentationNGIoT standardisation workshops_Jens Hagemeyer presentation
NGIoT standardisation workshops_Jens Hagemeyer presentation
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
State of GeoServer
State of GeoServerState of GeoServer
State of GeoServer
 
Tech 2 Tech: Network performance
Tech 2 Tech: Network performanceTech 2 Tech: Network performance
Tech 2 Tech: Network performance
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Introduction to Fog
Introduction to FogIntroduction to Fog
Introduction to Fog
 
Cisco Multi-Service FAN Solution
Cisco Multi-Service FAN SolutionCisco Multi-Service FAN Solution
Cisco Multi-Service FAN Solution
 
Nobuya Okada presentation
Nobuya Okada presentationNobuya Okada presentation
Nobuya Okada presentation
 
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P..."Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
PIT Overload Analysis in Content Centric Networks - Slides ICN '13PIT Overload Analysis in Content Centric Networks - Slides ICN '13
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
 

More from VasileiosMezaris

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
VasileiosMezaris
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
VasileiosMezaris
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
VasileiosMezaris
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
VasileiosMezaris
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
VasileiosMezaris
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
VasileiosMezaris
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
VasileiosMezaris
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
VasileiosMezaris
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
VasileiosMezaris
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
VasileiosMezaris
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
VasileiosMezaris
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
VasileiosMezaris
 
PGL SUM Video Summarization
PGL SUM Video SummarizationPGL SUM Video Summarization
PGL SUM Video Summarization
VasileiosMezaris
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
VasileiosMezaris
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
VasileiosMezaris
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
VasileiosMezaris
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
VasileiosMezaris
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
VasileiosMezaris
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
VasileiosMezaris
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
VasileiosMezaris
 

More from VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
PGL SUM Video Summarization
PGL SUM Video SummarizationPGL SUM Video Summarization
PGL SUM Video Summarization
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 

Recently uploaded

Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 

Recently uploaded (20)

Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 

Gated-ViGAT

  • 1. Title of presentation Subtitle Name of presenter Date Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE Int. Symposium on Multimedia, Naples, Italy, Dec. 2022
  • 2. 2 • The recognition of high-level events in unconstrained video is an important topic with applications in: security (e.g. “making a bomb”), automotive industry (e.g. “pedestrian crossing the street”), etc. • Most approaches are top-down: “patchify” the frame (context agnostic); use label and loss function to learn focusing on frame regions related with event • Bottom-up approaches: use an object detector, feature extractor and graph network to extract and process features from the main objects in the video Introduction Video event “walking the dog”
  • 3. 3 • Our recent bottom-up approach with SOTA performance in many datasets • Uses a graph attention network (GAT) head to process local (object) & global (frame) information • Also provides frame/object-level explanations (in contrast to top-down ones) Video event “removing ice from car” miscategorized as “shoveling snow” Object-level explanation: classifier does not focus on the car object ViGAT
  • 4. 4 • Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s nodes) to a feature vector (representing the whole graph) • Computes explanation significance (weighted in-degrees, WiDs) of each node using the graph’s adjacency matrix Attention Mechanism GAT head Graph pooling X (K x F) A (K x K) Ζ (K x F) η (1 x F) 𝝋𝒍 = 𝒌=𝟏 𝑲 𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲 Computation of Attention matrix from node features; and Adjacency Matrix using attention coefficients Multiplication of node features with Adjacency Matrix Production of vector- representation of the graph WiDs: Explanation significance of l-th node ViGAT block
  • 5. ω2 ω2 5 K K objects object-level features b frame-level local features P ω2 P P P ω3 b frame-level global features P ω1 concat u video feature o video frames video-level global feature mean video-level local feature K frame WiDs (local info) frame WiDs (global info) object WiDs P P P Recognized Event: Playing beach volleyball! Explanation: Event supporting frames and objects ViGAT architecture max3 max o: object detector b: feature extractor u: classification head GAT blocks: ω1, ω2, ω3 Global branch: ω1 Local branch: ω2, ω3 Local information Global information
  • 6. 6 • ViGAT has high computational cost due to local (object) information processing (e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video) • Efficient video processing has investigated at the top-down (frame) paradigm: - Frame selection policy: identify most important frames for classification - Gating component: stop processing frames when sufficient evidence achieved • Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce the computational complexity in the local processing pipeline of ViGAT? ViGAT
  • 7. 7 K b P Q ω3 concat u video feature o Extracted video frames mean video-level local feature K Frame WiDs (local info) Object WiDs (local info) Q(s) Frame selection policy Q(s) Q(s) Q(s) Q(s) Q(s) g(s) ON/OFF concat max Explanation: Event supporting frames and objects Recognized Event: Playing beach volleyball! Computed video-level global feature Computed frame WiDs (global info) u1 uP max3 Gate is closed: Request Q(s+1) - Q(s) additional frames ζ(s) ζ(1) g(1) g(S) Z(s) Gated-ViGAT ω2 ω2 ω2 Local information processing pipeline
  • 8. 8 • Iterative algorithm to select Q frames frame-level global features frame WiDs (global info) argmax p1 minmax minmax αp = (1/2) (1 – γp Τγpi-1 ) γp = γp /|γp| γ1 γP uP u1 uP u1 α1 αP pi argmax u1 uP up = αp up u1 uP α1 αP 1. Initialize 2. Select Q-1 frames Input: Q, frame index p1, P feature vectors Iterate for i= 2 to Q-1 γ1 γP Gated-ViGAT: Frame selection policy
  • 9. 9 • Each gate has a GAT block-like structure and binary classification head (open/close); corresponds to specified number of frames Q(s); trained to provide 1 (i.e. open) when ViGAT loss is low; design hyperparameters:Q(s) , β (sensitivity) Use frame selection policy to select Q(s) frames for gate g(s) Compute the video-level local feature ζ(s) (and Z(s)) Compute ViGAT classification loss: lce = CE(label, y) Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise Compute gate component loss: 𝐿 = 1 𝑆 𝑠=1 𝑆 𝑙𝑏𝑐𝑒(𝑔 𝑠 𝒁 𝑠 , 𝑜(𝑠) ) Perform backpropagation to update gate weights concat u video feature video-level local feature g(s) concat Computed video-level global feature ζ(s) ζ(1) g(1) g(S) Local ViGAT branch Z(s) Ground truth label cross entropy y Binary cross entropy o(s) Gated-ViGAT: Gate training Select Q(s) video frames for gate g(s) Q o
  • 10. 10 • ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel • MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label • Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics • Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50 objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks (pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics) • Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths), for ActivityNet/MiniKinetics • Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35 • Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs • Gated-ViGAT is compared against top-scoring methods in the two datasets Experiments
  • 11. 11 Methods in MiniKinetics Top-1% TBN [30] 69.5 BAT [7] 70.6 MARS (3D ResNet) [31] 72.8 Fast-S3D (Inception) [14] 78 ATFR (X3D-S) [18] 78 ATFR (R(2+1D)) [18] 78.2 RMS (SlowOnly) [28] 78.6 ATFR (I3D) [18] 78.8 Ada3D (I3D, Kinetics) [32] 79.2 ATFR (3D Resnet) [18] 79.3 CGNL (Modified ResNet) [17] 79.5 TCPNet (ResNet, Kinetics) [3] 80.7 LgNet (R3D) [3] 80.9 FrameExit (EfficientNet) [1] 75.3 ViGAT [9] 82.1 Gated-ViGAT (proposed) 81.3 • Gated-ViGAT outperforms all top-down approaches • Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction • As expected, has higher computational complexity than many top-down approaches (e.g. see [3], [4]) but can provide explanations Methods in ActivityNet mAP% AdaFrame [21] 71.5 ListenToLook [23] 72.3 LiteEval [33] 72.7 SCSampler [25] 72.9 AR-Net [13] 73.8 FrameExit [1] 77.3 AR-Net (EfficientNet) [13] 79.7 MARL (ResNet, Kinetics) [22] 82.9 FrameExit (X3D-S) [1] 87.4 ViGAT [9] 88.1 Gated-ViGAT (proposed) 87.3 FLOPS in 2 datasets ViGAT Gated-ViGAT ActivityNet 137.4 24.8 MiniKinetics 34.4 8.7 Experiments: results *Best and second best performance are denoted with bold and underline
  • 12. 12 • Computed # of videos processed and recognition performance for each gate • Average number of frames for ActivityNet / MiniKinetics: 20 / 7 • Recognition rate drops with gate number increase; this behavior is more clearly shown in ActivityNet (longer videos) • Conclusion: more “easy” videos exit early, more “difficult” videos still difficult to recognize even with many frames (similar conclusion with [1]) ActivityNet g(1) g(2) g(3) g(4) g(5) g(6) # frames 9 12 16 20 25 30 # videos 793 651 722 502 535 1722 mAP% 99.8 94.5 93.8 92.7 86 71.6 MiniKinetics g(1) g(2) g(3) g(4) g(5) # frames 2 4 6 8 10 # videos 179 686 1199 458 2477 Top-1% 84.9 83 81.1 84.9 80.7 Experiments: method insight
  • 13. 13 • Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT) • Frames selected with the proposed policy, both explain recognition result and provide diverse view of the video: help to recognize video with fewer frames Bullfighting Cricket Experiments: examples
  • 14. 14 • Can also provide explanations at object-level (in contrast to top-down methods) “Waterskiing” predicted as “Making a sandwich” “Playing accordion” predicted as “Playing guitarra” “Breakdancing” (correct prediction) Experiments: examples
  • 15. 15 Policy / #frames 10 20 30 Random 83 85.5 86.5 WiD-based 84.9 86.1 86.9 Random on local 85.4 86.6 86.9 WiD-based on local 86.6 87.1 87.5 FrameExit policy 86.2 87.3 87.5 Proposed policy 86.7 87.3 87.6 Gated-ViGAT (proposed) 86.8 87.5 87.7 Experiments: ablation study on frame selection policies • Comparison (mAP%) on ActivityNet • Gated-ViGAT selects diverse frames with high explanation potential • Proposed policy is second best (surpassing FrameExit [1], current SOTA) Random: Θ frames selected randomly for local/global features WiD-Based: Θ frames are selected using global WiDs Random local: P frames derive global feature; Θ frames selected randomly WiD-based local: P frames derive global feature; Θ frames using global WiDs FrameExit policy: Θ frames are selected using policy in [1] Proposed policy: P frames derive global feature; Θ selected using proposed Gated-ViGAT: in addition to above gate component selects Θ frames in average
  • 16. 16 • Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy Proposed WiD-based Updated WiDs Experiments: ablation study example
  • 17. 17 • An efficient bottom-up event recognition and explanation approach presented • Utilizes a new policy algorithm to select frames that: a) explain best the classifier’s decision, b) provide diverse information of the underlying event • Utilizes a gating mechanism to instruct the model to stop extracting bottom- up (object) information when sufficient evidence of the event is achieved • Evaluation on 2 datasets provided competitive recognition performance and approx. 5 times FLOPs reduction in comparison to previous SOTA • Future work: investigations for further efficiency improvements, e.g.: faster object detector, feature extractor, frame selection also for the global information pipeline, etc. Conclusions
  • 18. 18 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code publicly available at: https://github.com/bmezaris/Gated-ViGAT This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement 101021866 CRiTERIA