Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
WATCH, LISTEN AND TELL: MULTI-MODAL WEAKLY SUPERVISED DENSE EVENT CAPTIONING
1. WATCH, LISTEN AND TELL:
MULTI-MODAL WEAKLY
SUPERVISED DENSE EVENT
CAPTIONING
Seminar Hot Topics in Computer Vision
By Safaa Alnabulsi
1
22.12.20
Safaa Alnabulsi,TU Berlin
2. AGENDA
▪ The goal
▪ The dataset
▪ The baseline model
▪ The model in the paper
▪ The used algorithm
▪ Different feature representations
▪ Different fusion strategies
▪ Different evaluation scehmas
▪ The results
▪ The limitations
2
22.12.20
Safaa Alnabulsi,TU Berlin
3. VIDEO UNDERSTANDING
Safaa Alnabulsi,TU Berlin 22.12.20 3
Applications
Content-based recommednition and
retrieval
Autonomous driving
Surveillance
Software for visually-impaired people
Approaches
Action recognition
Content summarization
Action anticipation
Video question answering
Video captioning
4. GENERAL GOAL
▪ Both Detecting and describing events in a video
4
22.12.20
Safaa Alnabulsi,TU Berlin
5. MAIN GOAL OF THIS PAPER
▪ Prove that audio signals can carry surprising amount of information when it comes to
high-level visual-lingual tasks
▪ show that audio signal alone can achieve impressive performance on the dense
event captioning task
5
22.12.20
Safaa Alnabulsi,TU Berlin
6. THE DATASET
▪ 20k videos
▪ Avg. 3.56 temporally localized
senteces
▪ Avg. 13.48 words per sentences
6
22.12.20
Safaa Alnabulsi,TU Berlin
7. ENTRY KEY IN ACTIVITYNET
CAPTIONS DATASET
7
22.12.20
Safaa Alnabulsi,TU Berlin
10. THE BASELINE MODEL
▪ The problem was decomposed into a
pair of dual problems:
▪ event captioning
▪ sentence localization
10
22.12.20
Safaa Alnabulsi,TU Berlin
14. DIFFERENT FEATURE
REPRESENTATIONS
▪ Audio Feature Processing
▪ MFCC Features
▪ CQT Features
▪ SoundNet Features
▪ Video Feature Processing
▪ 3D-CNN model is used to process the
input video frames into a sequence of
visual features
14
22.12.20
Safaa Alnabulsi,TU Berlin
17. THE RESULTS
▪ we find that MUTAN fusion is the most appropriate one for our weakly supervised
multi-modal dense event captioning task.
▪ The multi-modal approach (both MFCC and SoundNet audio with C3D video
features) outperforms state-of-the-art unimodal method.
▪ The multi-modal approaches outperform unimodal ones, both on caption quality
and temporal segment accuracy.
17
22.12.20
Safaa Alnabulsi,TU Berlin
19. THE LIMITATIONS
▪ Sometimes the multi-modal model can not detect the beginning of an event
correctly.
▪ Most of the time the final model only generates around 2 event captions
which means that the multi-modal approach is still not good enough to detect all the
events in the weakly supervised setting.
19
22.12.20
Safaa Alnabulsi,TU Berlin