SlideShare a Scribd company logo
1 of 20
Download to read offline
WATCH, LISTEN AND TELL:
MULTI-MODAL WEAKLY
SUPERVISED DENSE EVENT
CAPTIONING
Seminar Hot Topics in Computer Vision
By Safaa Alnabulsi
1
22.12.20
Safaa Alnabulsi,TU Berlin
AGENDA
▪ The goal
▪ The dataset
▪ The baseline model
▪ The model in the paper
▪ The used algorithm
▪ Different feature representations
▪ Different fusion strategies
▪ Different evaluation scehmas
▪ The results
▪ The limitations
2
22.12.20
Safaa Alnabulsi,TU Berlin
VIDEO UNDERSTANDING
Safaa Alnabulsi,TU Berlin 22.12.20 3
Applications
Content-based recommednition and
retrieval
Autonomous driving
Surveillance
Software for visually-impaired people
Approaches
Action recognition
Content summarization
Action anticipation
Video question answering
Video captioning
GENERAL GOAL
▪ Both Detecting and describing events in a video
4
22.12.20
Safaa Alnabulsi,TU Berlin
MAIN GOAL OF THIS PAPER
▪ Prove that audio signals can carry surprising amount of information when it comes to
high-level visual-lingual tasks
▪ show that audio signal alone can achieve impressive performance on the dense
event captioning task
5
22.12.20
Safaa Alnabulsi,TU Berlin
THE DATASET
▪ 20k videos
▪ Avg. 3.56 temporally localized
senteces
▪ Avg. 13.48 words per sentences
6
22.12.20
Safaa Alnabulsi,TU Berlin
ENTRY KEY IN ACTIVITYNET
CAPTIONS DATASET
7
22.12.20
Safaa Alnabulsi,TU Berlin
THE ACTIVITY NET
CHALLANGE
8
22.12.20
Safaa Alnabulsi,TU Berlin
THE ACTIVITY NET
CHALLANGE
9
22.12.20
Safaa Alnabulsi,TU Berlin
THE BASELINE MODEL
▪ The problem was decomposed into a
pair of dual problems:
▪ event captioning
▪ sentence localization
10
22.12.20
Safaa Alnabulsi,TU Berlin
THE BASELINE MODEL
11
22.12.20
Safaa Alnabulsi,TU Berlin
THE MODEL IN THIS PAPER
12
22.12.20
Safaa Alnabulsi,TU Berlin
THE USED ALGORITHM
▪ Weakly Supervised
13
22.12.20
Safaa Alnabulsi,TU Berlin
DIFFERENT FEATURE
REPRESENTATIONS
▪ Audio Feature Processing
▪ MFCC Features
▪ CQT Features
▪ SoundNet Features
▪ Video Feature Processing
▪ 3D-CNN model is used to process the
input video frames into a sequence of
visual features
14
22.12.20
Safaa Alnabulsi,TU Berlin
DIFFERENT FUSION
STRATEGIES
15
22.12.20
Safaa Alnabulsi,TU Berlin
DIFFERENT EVALUATION
SCEHMAS
▪ TEOR (M)
▪ CIDEr (C)
▪ Rouge-L
▪ Spice (S)
▪ Bleu@N (B@N)
16
22.12.20
Safaa Alnabulsi,TU Berlin
THE RESULTS
▪ we find that MUTAN fusion is the most appropriate one for our weakly supervised
multi-modal dense event captioning task.
▪ The multi-modal approach (both MFCC and SoundNet audio with C3D video
features) outperforms state-of-the-art unimodal method.
▪ The multi-modal approaches outperform unimodal ones, both on caption quality
and temporal segment accuracy.
17
22.12.20
Safaa Alnabulsi,TU Berlin
EXAMPLE
18
22.12.20
Safaa Alnabulsi,TU Berlin
THE LIMITATIONS
▪ Sometimes the multi-modal model can not detect the beginning of an event
correctly.
▪ Most of the time the final model only generates around 2 event captions
which means that the multi-modal approach is still not good enough to detect all the
events in the weakly supervised setting.
19
22.12.20
Safaa Alnabulsi,TU Berlin
THANK
YOU FOR
YOUR
ATTENTIO
N
QUESTIO
NS? 20
Safaa Alnabulsi,TU Berlin 22.12.20

More Related Content

Recently uploaded

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Recently uploaded (20)

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

WATCH, LISTEN AND TELL: MULTI-MODAL WEAKLY SUPERVISED DENSE EVENT CAPTIONING

  • 1. WATCH, LISTEN AND TELL: MULTI-MODAL WEAKLY SUPERVISED DENSE EVENT CAPTIONING Seminar Hot Topics in Computer Vision By Safaa Alnabulsi 1 22.12.20 Safaa Alnabulsi,TU Berlin
  • 2. AGENDA ▪ The goal ▪ The dataset ▪ The baseline model ▪ The model in the paper ▪ The used algorithm ▪ Different feature representations ▪ Different fusion strategies ▪ Different evaluation scehmas ▪ The results ▪ The limitations 2 22.12.20 Safaa Alnabulsi,TU Berlin
  • 3. VIDEO UNDERSTANDING Safaa Alnabulsi,TU Berlin 22.12.20 3 Applications Content-based recommednition and retrieval Autonomous driving Surveillance Software for visually-impaired people Approaches Action recognition Content summarization Action anticipation Video question answering Video captioning
  • 4. GENERAL GOAL ▪ Both Detecting and describing events in a video 4 22.12.20 Safaa Alnabulsi,TU Berlin
  • 5. MAIN GOAL OF THIS PAPER ▪ Prove that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks ▪ show that audio signal alone can achieve impressive performance on the dense event captioning task 5 22.12.20 Safaa Alnabulsi,TU Berlin
  • 6. THE DATASET ▪ 20k videos ▪ Avg. 3.56 temporally localized senteces ▪ Avg. 13.48 words per sentences 6 22.12.20 Safaa Alnabulsi,TU Berlin
  • 7. ENTRY KEY IN ACTIVITYNET CAPTIONS DATASET 7 22.12.20 Safaa Alnabulsi,TU Berlin
  • 10. THE BASELINE MODEL ▪ The problem was decomposed into a pair of dual problems: ▪ event captioning ▪ sentence localization 10 22.12.20 Safaa Alnabulsi,TU Berlin
  • 12. THE MODEL IN THIS PAPER 12 22.12.20 Safaa Alnabulsi,TU Berlin
  • 13. THE USED ALGORITHM ▪ Weakly Supervised 13 22.12.20 Safaa Alnabulsi,TU Berlin
  • 14. DIFFERENT FEATURE REPRESENTATIONS ▪ Audio Feature Processing ▪ MFCC Features ▪ CQT Features ▪ SoundNet Features ▪ Video Feature Processing ▪ 3D-CNN model is used to process the input video frames into a sequence of visual features 14 22.12.20 Safaa Alnabulsi,TU Berlin
  • 16. DIFFERENT EVALUATION SCEHMAS ▪ TEOR (M) ▪ CIDEr (C) ▪ Rouge-L ▪ Spice (S) ▪ Bleu@N (B@N) 16 22.12.20 Safaa Alnabulsi,TU Berlin
  • 17. THE RESULTS ▪ we find that MUTAN fusion is the most appropriate one for our weakly supervised multi-modal dense event captioning task. ▪ The multi-modal approach (both MFCC and SoundNet audio with C3D video features) outperforms state-of-the-art unimodal method. ▪ The multi-modal approaches outperform unimodal ones, both on caption quality and temporal segment accuracy. 17 22.12.20 Safaa Alnabulsi,TU Berlin
  • 19. THE LIMITATIONS ▪ Sometimes the multi-modal model can not detect the beginning of an event correctly. ▪ Most of the time the final model only generates around 2 event captions which means that the multi-modal approach is still not good enough to detect all the events in the weakly supervised setting. 19 22.12.20 Safaa Alnabulsi,TU Berlin