Audio-Visual Semantic Graph
Network for Audio-Visual Event
Localization
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2025/06/09
Liang Liu et al.
CVPR 2025
2
Introduction
Figure 1. An example illustrating the AVEL task: a segment is marked as an “AV Event” only when an event is both visible and audible simultaneously; otherwise, it is considered
background.
3
Introduction
Audio-Visual Event Localization (AVEL)
● Human Perception & Machine Emulation: Humans naturally integrate multiple senses, especially vision and
hearing, to understand their surroundings. Audio-visual learning aims to enable intelligent machines to emulate
this capability for perception, reasoning, and decision-making
● What is AVEL?
○ A prominent area within audio-visual learning
○ Goal: Identify events that are both audible and visible simultaneously in unconstrained videos
○ Tasks Involved:
■ Classifying event categories (e.g., "car driving," "dog barking")
■ Accurately determining their temporal boundaries
● Examples: A segment is an "AV Event" only if both visible and audible events occur simultaneously; otherwise, it's
considered background. For instance, a "Train horn" event involves both the visual presence of a train and the
audible horn
● Applications: Intelligent surveillance, human-computer interaction, and multimedia retrieval
4
Introduction
Challenges
● Semantic Gap: The primary challenge is the inherent semantic gap between heterogeneous modalities (audio and
visual). This often leads to audio-visual semantic inconsistency
● Temporal Inconsistencies: Existing methods struggle with capturing cross-temporal dependencies sufficiently
● Limitations of Previous Approaches:
○ Many rely on unimodal-guided attention (e.g., audio-guided, visual-guided) which may overlook the potential
of one modality to guide the other
○ These methods can introduce redundant or event-irrelevant information
○ While multimodal adaptive fusion and semantic consistency modeling have shown effectiveness, they still
face challenges in establishing robust cross-modal semantic consistency and temporal dependencies
5
Proposed Method
Figure 2. An overview of the proposed audio-visual semantic graph network (AVSGN). First, frozen pre-trained encoders are utilized to extract visual, audio and text embedding,
respectively. Then, the cross-modal semantic alignment module aligns these multimodal information into the shared semantic space. Subsequently, three subgraphs are explicitly
constructed to capture complex interactions across modalities. Finally, a localization layer predicts segment-level event relevance scores and video-level event categories.
6
Proposed Method
7
Proposed Method
Cross-Modal Semantic Alignment (CMSA) Module
● Align heterogeneous context into a shared semantic space, narrowing the
semantic gap
● Attention-driven: This module is primarily attention-based, unlike contrastive
learning-based methods
● Pseudo Intra-modal Alignment:
○ Project 2 text embeddings into share space
global label text embedding
linear layer
○ Intra-CLIP visual attention (Vattn):
○ Intra-CLAP visual attention (Aattn):
8
Proposed Method
Cross-Modal Semantic Alignment (CMSA) Module
● Inter-modal Alignment:
○ Leverages the attention mechanism to bring the target
modality closer to the other two modalities
FC layer with ReLU activation
nonlinear layers with ReLU activation function
Figure 3. An illustration of cross-modal alignment, taking visual alignment
as example.
9
Proposed Method
Cross-Modal Graph Interaction (CMGI) Module
● Capture dynamic dependencies between multiple modalities and suppress
temporal inconsistencies
● Disentangled Interactions: Disentangles complex interactions into three
complementary subgraphs
○ Audio-Visual (Gav): Nodes are visual/audio features; adjacency
relationships are dynamically adjusted based on semantic similarity
using attention
○ Visual-Text (Gvt)
○ Audio-Text (Gat)
● GCN: Attention weights and node features are fed into a multilayer GCN to
propagate information and aggregate node representations. Each GCN
layer passes local information to neighbor nodes using the adjacency matrix
● After GCN layers, modality-specific representations are obtained from each
subgraph (e.g., Aav, Vav from Gav; Aat, Tat from Gat; Vvt, Tvt from Gvt)
● Final representations: A* (audio), V* (visual), and T* (text)
10
Proposed Method
Classification
● Gated Mechanism: A gated mechanism integrates complementary information from the
discriminative audio (A*) and visual (V*) embeddings
● Fully-supervised AVEL Task:
○ Video-level category prediction (sc) and segment-level relevance prediction (st)
○ Losses:
event-relevant loss event category loss
gate losssegment loss
● Weakly-supervised AVE Task:
11
Experiments
● Dataset
○ AVE from AudioSet
■ 4,143 videos across 28 categories (dog barking, helicopter, acoustic guitar)
● Each video is divided into 10 one-second segments
● Provides both segment-level AVE labels and video-level event category labels
● Implementation Details:
○ Encoders: Frozen Swin-V2-L-based CLIP for visual features and HTS-AT-based CLAP for audio
features, both providing 512D embeddings. Text embeddings from CLIP and CLAP
○ Optimizer: Adam with batch size 64
○ Hyperparameters: τ (threshold) defaults to 0.6, λ (hyperparameter for semantic consistency loss)
defaults to 0.01
12
Experiments
Table 1. Performance (%) comparison with SOTA methods in the fully- and weakly-supervised settings on the AVE dataset.
13
Experiments
Table 2. Ablation studies on the effectiveness of pseudo
intramodal and inter-modal alignment in the CMSA module.
Table 3. Ablation studies on the effectiveness of three
subgraph structures in the CMGI module.
Table 4. Ablation studies on the effectiveness of loss
functions on fully-supervised setting.
14
Experiments
Figure 4. Qualitative results of our model on the motorcycle event (top) and the dog bark event (bottom).
15
Experiments
Figure 5. Illustration of feature distributions using t-SNE [31] in fully-supervised learning. The first row shows the distribution of audio
features after different components, while the second row shows the distribution of visual features.
16
Conclusion
● Introduces a novel Audio-Visual Semantic Graph Network (AVSGN) to address audio-visual inconsistency in
AVEL
● Cross-Modal Semantic Alignment (CMSA) module: Bridges the semantic gap by introducing shared semantic
labels, promoting multimodal representation convergence into a shared semantic space
● Cross-Modal Graph Interaction (CMGI) module: Disentangles complex interactions into three complementary
subgraphs (audio-text, audio-visual, visual-text) to effectively capture cross-temporal semantic interactions
and suppress temporal inconsistencies
● Future Directions: The framework can be extended to other related multimodal tasks, such as audio-visual
video parsing (AVVP), where the advantages of a graph-based model could more effectively handle multi-
modal, multi-instance tasks

[NS][Lab_Seminar_250609]Audio-Visual Semantic Graph Network for Audio-Visual Event Localization.pptx

  • 1.
    Audio-Visual Semantic Graph Networkfor Audio-Visual Event Localization Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: osfa19730@catholic.ac.kr 2025/06/09 Liang Liu et al. CVPR 2025
  • 2.
    2 Introduction Figure 1. Anexample illustrating the AVEL task: a segment is marked as an “AV Event” only when an event is both visible and audible simultaneously; otherwise, it is considered background.
  • 3.
    3 Introduction Audio-Visual Event Localization(AVEL) ● Human Perception & Machine Emulation: Humans naturally integrate multiple senses, especially vision and hearing, to understand their surroundings. Audio-visual learning aims to enable intelligent machines to emulate this capability for perception, reasoning, and decision-making ● What is AVEL? ○ A prominent area within audio-visual learning ○ Goal: Identify events that are both audible and visible simultaneously in unconstrained videos ○ Tasks Involved: ■ Classifying event categories (e.g., "car driving," "dog barking") ■ Accurately determining their temporal boundaries ● Examples: A segment is an "AV Event" only if both visible and audible events occur simultaneously; otherwise, it's considered background. For instance, a "Train horn" event involves both the visual presence of a train and the audible horn ● Applications: Intelligent surveillance, human-computer interaction, and multimedia retrieval
  • 4.
    4 Introduction Challenges ● Semantic Gap:The primary challenge is the inherent semantic gap between heterogeneous modalities (audio and visual). This often leads to audio-visual semantic inconsistency ● Temporal Inconsistencies: Existing methods struggle with capturing cross-temporal dependencies sufficiently ● Limitations of Previous Approaches: ○ Many rely on unimodal-guided attention (e.g., audio-guided, visual-guided) which may overlook the potential of one modality to guide the other ○ These methods can introduce redundant or event-irrelevant information ○ While multimodal adaptive fusion and semantic consistency modeling have shown effectiveness, they still face challenges in establishing robust cross-modal semantic consistency and temporal dependencies
  • 5.
    5 Proposed Method Figure 2.An overview of the proposed audio-visual semantic graph network (AVSGN). First, frozen pre-trained encoders are utilized to extract visual, audio and text embedding, respectively. Then, the cross-modal semantic alignment module aligns these multimodal information into the shared semantic space. Subsequently, three subgraphs are explicitly constructed to capture complex interactions across modalities. Finally, a localization layer predicts segment-level event relevance scores and video-level event categories.
  • 6.
  • 7.
    7 Proposed Method Cross-Modal SemanticAlignment (CMSA) Module ● Align heterogeneous context into a shared semantic space, narrowing the semantic gap ● Attention-driven: This module is primarily attention-based, unlike contrastive learning-based methods ● Pseudo Intra-modal Alignment: ○ Project 2 text embeddings into share space global label text embedding linear layer ○ Intra-CLIP visual attention (Vattn): ○ Intra-CLAP visual attention (Aattn):
  • 8.
    8 Proposed Method Cross-Modal SemanticAlignment (CMSA) Module ● Inter-modal Alignment: ○ Leverages the attention mechanism to bring the target modality closer to the other two modalities FC layer with ReLU activation nonlinear layers with ReLU activation function Figure 3. An illustration of cross-modal alignment, taking visual alignment as example.
  • 9.
    9 Proposed Method Cross-Modal GraphInteraction (CMGI) Module ● Capture dynamic dependencies between multiple modalities and suppress temporal inconsistencies ● Disentangled Interactions: Disentangles complex interactions into three complementary subgraphs ○ Audio-Visual (Gav): Nodes are visual/audio features; adjacency relationships are dynamically adjusted based on semantic similarity using attention ○ Visual-Text (Gvt) ○ Audio-Text (Gat) ● GCN: Attention weights and node features are fed into a multilayer GCN to propagate information and aggregate node representations. Each GCN layer passes local information to neighbor nodes using the adjacency matrix ● After GCN layers, modality-specific representations are obtained from each subgraph (e.g., Aav, Vav from Gav; Aat, Tat from Gat; Vvt, Tvt from Gvt) ● Final representations: A* (audio), V* (visual), and T* (text)
  • 10.
    10 Proposed Method Classification ● GatedMechanism: A gated mechanism integrates complementary information from the discriminative audio (A*) and visual (V*) embeddings ● Fully-supervised AVEL Task: ○ Video-level category prediction (sc) and segment-level relevance prediction (st) ○ Losses: event-relevant loss event category loss gate losssegment loss ● Weakly-supervised AVE Task:
  • 11.
    11 Experiments ● Dataset ○ AVEfrom AudioSet ■ 4,143 videos across 28 categories (dog barking, helicopter, acoustic guitar) ● Each video is divided into 10 one-second segments ● Provides both segment-level AVE labels and video-level event category labels ● Implementation Details: ○ Encoders: Frozen Swin-V2-L-based CLIP for visual features and HTS-AT-based CLAP for audio features, both providing 512D embeddings. Text embeddings from CLIP and CLAP ○ Optimizer: Adam with batch size 64 ○ Hyperparameters: τ (threshold) defaults to 0.6, λ (hyperparameter for semantic consistency loss) defaults to 0.01
  • 12.
    12 Experiments Table 1. Performance(%) comparison with SOTA methods in the fully- and weakly-supervised settings on the AVE dataset.
  • 13.
    13 Experiments Table 2. Ablationstudies on the effectiveness of pseudo intramodal and inter-modal alignment in the CMSA module. Table 3. Ablation studies on the effectiveness of three subgraph structures in the CMGI module. Table 4. Ablation studies on the effectiveness of loss functions on fully-supervised setting.
  • 14.
    14 Experiments Figure 4. Qualitativeresults of our model on the motorcycle event (top) and the dog bark event (bottom).
  • 15.
    15 Experiments Figure 5. Illustrationof feature distributions using t-SNE [31] in fully-supervised learning. The first row shows the distribution of audio features after different components, while the second row shows the distribution of visual features.
  • 16.
    16 Conclusion ● Introduces anovel Audio-Visual Semantic Graph Network (AVSGN) to address audio-visual inconsistency in AVEL ● Cross-Modal Semantic Alignment (CMSA) module: Bridges the semantic gap by introducing shared semantic labels, promoting multimodal representation convergence into a shared semantic space ● Cross-Modal Graph Interaction (CMGI) module: Disentangles complex interactions into three complementary subgraphs (audio-text, audio-visual, visual-text) to effectively capture cross-temporal semantic interactions and suppress temporal inconsistencies ● Future Directions: The framework can be extended to other related multimodal tasks, such as audio-visual video parsing (AVVP), where the advantages of a graph-based model could more effectively handle multi- modal, multi-instance tasks