Successfully reported this slideshow.
Your SlideShare is downloading. ×

ObjectGraphs

Ad

Title of presentation
Subtitle
Name of presenter
Date
ObjectGraphs: Using Objects and a Graph Convolutional Network
for th...

Ad

2
• The recognition of high-level events in unconstrained video is a major research
topic in multimedia understanding
Intr...

Ad

3
ObjectGraphs
• Assume an annotated training set of N videos and C classes
• Keyframe sampling: each video is represented...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
CA-SUM Video Summarization
CA-SUM Video Summarization
Loading in …3
×

Check these out next

1 of 12 Ad
1 of 12 Ad

ObjectGraphs

Download to read offline

N. Gkalelis, A. Goulas, D. Galanopoulos, V. Mezaris, "ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2nd Int. Workshop on Large Scale Holistic Video Understanding (HVU), pp. 3375-3383, June 2021. Software available at https://github.com/bmezaris/ObjectGraphs

In this paper a novel bottom-up video event recognition approach is proposed, ObjectGraphs, which utilizes a rich frame representation and the relations between objects within each frame. Following the application of an object detector (OD) on the frames, graphs are used to model the object relations and a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting object-based frame-level features are then forwarded to a long short-term memory (LSTM) network for video event recognition. Moreover, the weighted in-degrees (WiDs) derived from the graph’s adjacency matrix at frame level are used for identifying the objects that were considered most (or least) salient for event recognition and contributed the most (or least) to the final event recognition decision, thus providing an explanation for the latter. The experimental results show that the proposed method achieves state-of-the-art performance on the publicly available FCVID and YLI-MED datasets.

N. Gkalelis, A. Goulas, D. Galanopoulos, V. Mezaris, "ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2nd Int. Workshop on Large Scale Holistic Video Understanding (HVU), pp. 3375-3383, June 2021. Software available at https://github.com/bmezaris/ObjectGraphs

In this paper a novel bottom-up video event recognition approach is proposed, ObjectGraphs, which utilizes a rich frame representation and the relations between objects within each frame. Following the application of an object detector (OD) on the frames, graphs are used to model the object relations and a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting object-based frame-level features are then forwarded to a long short-term memory (LSTM) network for video event recognition. Moreover, the weighted in-degrees (WiDs) derived from the graph’s adjacency matrix at frame level are used for identifying the objects that were considered most (or least) salient for event recognition and contributed the most (or least) to the final event recognition decision, thus providing an explanation for the latter. The experimental results show that the proposed method achieves state-of-the-art performance on the publicly available FCVID and YLI-MED datasets.

Advertisement
Advertisement

More Related Content

Advertisement

ObjectGraphs

  1. 1. Title of presentation Subtitle Name of presenter Date ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video N. Gkalelis, A. Goulas, D. Galanopoulos, V. Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2nd Int. Workshop on Large Scale Holistic Video Understanding, June 2021
  2. 2. 2 • The recognition of high-level events in unconstrained video is a major research topic in multimedia understanding Introduction “Landing a fish” (TRECVID Multimedia Event Detection dataset) • Most approaches are top-down: use event label to implicitly focus on frame regions mostly related with event • Bottom-up approaches: exploit discriminant information of semantic objects; have shown promising performance, e.g., in visual question answering Fish Fishing pole Hand
  3. 3. 3 ObjectGraphs • Assume an annotated training set of N videos and C classes • Keyframe sampling: each video is represented with Q frames • OD+CNN: derives K objects depicted in the frame (with highest DoC) • Object: object class label, DoC, BB, feature vector xk ∈ RF
  4. 4. 4 ObjectGraphs • Construct S ∈ RK x K : element of l-th row, k-th column is computed using (Wang & Gupta, ECCV 2018): 𝐒 𝑙,𝑘 = 𝐯𝑙 𝑇 𝐯𝑘 , 𝐯𝑙 = 𝐖𝐱𝑙 + 𝐛, 𝐯𝑘 = 𝐖𝐱𝑘 + 𝐛 • Ws ∈ RF x F, bs ∈ RF: learnable parameters • Obtain the adjacency matrix A ∈ RK x K from S so that (Yang et al., CVPR 2020): a) [A]l,k ∈ [0,1] b) k[A]l,k =1 (all edge values from l-th object are normalized to sum to one) 𝐀 𝑙,𝑘 = 𝐒 𝑙,𝑘 2 𝑘=1 𝐾 𝐒 𝑙,𝑘 2
  5. 5. 5 ObjectGraphs • M-layer GCN exploits the frame-level object information 𝐗[𝑚] = ReLU LN 𝐀𝐗 𝑚−1 𝐖[𝑚] , 𝑚 = 1, … , 𝑀, X[0] = [x1 ,…, xK]T • AVGPOOL layer derives local feature vector z’ at frame-level • CNN applied to the entire frame derives a global feature vector z’’ • CONCAT layer: derives z as frame-level feature vector representation • LSTM: processes sequence of frame-level feature vectors: 𝐡𝑗 = LSTM 𝐳𝑗 , 𝐡𝑗−1 , 𝑗 = 1 … , 𝑄 • Hidden state vector hQ at last time step used as video-level representation • Stack of FC layers provides a score for each event
  6. 6. 6 Explanation of event recognition results • Network parameters are learned via CE loss and event labels as target labels • The parameters of GCN’s adjacency matrix implicitly learn to amplify the contribution of the objects mostly relevant to the event!  How to use the adjacency matrix to derive the objects that mostly contributed to network’s decision?
  7. 7. 7 Explanation of event recognition results • Resort to Weighted in-degree (WiD) of a vertex (used in other domains, e.g. assess popularity of a person in social media) • WiD of vertex k (corresponding to object k) in adjacency matrix j (corresponding to frame j) can be computed using 𝛾𝑘 𝑗 = 𝑙=1 𝐾 𝐴𝑗 𝑙,𝑘 , 𝑘 = 1, … , 𝐾 • OD may detect several instances of the same object class in a frame/video • Average WiD: computed for each object class p at frame- and video-level
  8. 8. Experiments 8 • YLI-MED: TRECVID-style video dataset, 10 event classes, 1000 training, 823 testing videos • FCVID: multilabel YouTube video dataset, 239 classes (mostly real-world events), 45611 training, 45612 testing videos • ObjectGraphs is compared against top-scoring methods in literature
  9. 9. Experimental results 9 ACC(%) C3D+LSVM 65.61 3D-CNN 72.66 TSN 74.12 ActionVLAD 76.67 S2L 79.46 ObjectGraphs 83.60 mAP(%) ST-VLAD 77.5 PivotCorrNN 77.6 LiteEval 80 AdaFrame 80.2 SCSampler 81 AR-Net (ResNet backbone) 81.3 AR-Net (EfficientNet backbone) 84.4 ObjectGraphs (ResNet backbone) 84.6 • Evaluation results on FCVID (left) and YLI-MED (right) • Improve state-of-the-art performance by 0.2% (FCVID) and 4.14% (YLI-MED) • Comparison with equivalent AR-Net variant (ResNet backbone): +3.3% gain
  10. 10. Explanation results 10 • Correctly recognized “Wedding ceremony” (BBs of most/least significant objects based on WiDs) • High DoCs (right bar plot): general overview of the scene, but unrelated to the recognized event! • High WiDs (middle bar plot): frame regions where the network focuses to recognize the event
  11. 11. Explanation results 11 • “Working on a woodworking project” but mis-recognized as “Person attempting a board trick” • Objects with highest (video-level) WiDs: “Skate park” and “Skatepark”; respective regions influence the most the network’s decision • Note: wood construction’s roof highly resembles a skate park (detected as such by OD)!
  12. 12. 12 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code publicly available at: https://github.com/bmezaris/ObjectGraphs This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreements 832921 MIRROR and 951911 AI4Media

×