SlideShare a Scribd company logo
1 of 53
Download to read offline
TRECVID 2019
Video to Text Description
Asad A. Butt
NIST;JohnsHopkinsUniversity
George Awad
NIST;GeorgetownUniversity
Yvette Graham
DublinCityUniversity
1
TRECVID 2019
Goals and Motivations
üMeasure how well an automatic system can describe a video in natural
language.
üMeasure how well an automatic system can match high-level textual
descriptions to low-level computer vision features.
üTransfer successful image captioning technology to the video domain.
Real world Applications
üVideo summarization
üSupporting search and browsing
üAccessibility - video description to the blind
üVideo event prediction
2
TRECVID 2019
• Systems are asked to submit results for two
subtasks:
1. Description Generation (Core):
Automatically generate a text description for each video.
2. Matching & Ranking (Optional):
Return for each video a ranked list of the most likely text description
from each of the five sets.
3
SUBTASKS
TRECVID 2019
Video Dataset
The VTT data for 2019 consisted of two video sources:
• Twitter Vine:
• Crawled 50k+ Twitter Vine video URLs.
• Approximate video duration is 6 seconds.
• Selected 1044 Vine videos for this year’s task.
• Used since inception of VTT task.
• Flickr:
• Flickr video was collected under the Creative Commons License.
• A set of 91 videos was collected, which was divided into 74,958
segments.
• Approximate video duration is 10 seconds.
• Selected 1010 segments.
4
TRECVID 2019
Dataset Cleaning
§ Before selecting the dataset, we clustered videos based on
visual similarity.
§ Resulted in the removal of duplicate videos, as well as those which were
very visually similar (e.g. soccer games), resulting in a more diverse set
of videos.
§ Then, we manually went through large collection of videos.
§ Used list of commonly appearing topics to filter videos.
§ Removed videos with multiple, unrelated segments that are hard to
describe.
§ Removed any animated (or otherwise unsuitable) videos.
5
TRECVID 2019
Annotation Process
• A total of 10 assessors annotated the videos.
• Each video was annotated by 5 assessors.
• Annotation guidelines by NIST:
• For each video, annotators were asked to combine 4 facets if
applicable:
• Who is the video showing (objects, persons, animals, …etc) ?
• What are the objects and beings doing (actions, states, events,
…etc)?
• Where (locale, site, place, geographic, ...etc) ?
• When (time of day, season, ...etc) ?
6
TRECVID 2019
Annotation – Observations
• Questions asked:
• Q1 Avg Score: 2.03 (scale of 5)
• Q2 Avg Score: 2.51 (scale of 3)
• Correlation between difficulty
scores: -0.72
• Average Sentence Length for
each assessor:
TRECVID 2019
7
Assessor # Avg. Length
1 17.72
2 19.55
3 18.76
4 22.07
5 20.42
6 12.83
7 16.07
8 21.73
9 16.49
10 21.16
1 2 3 4 5
1 2 3
2019 Participants (10 teams finished)
Matching & Ranking (11 Runs) Description Generation (30 Runs)
IMFD_IMPRESEE P P
KSLAB P P
RUCMM P P
RUC_AIM3 P P
EURECOM_MeMAD P
FDU P
INSIGHT_DCU P
KU_ISPL P
PICSOM P
UTS_ISA P
8
TRECVID 2019
Run Types
• Each run was classified by the following run
type:
• 'I': Only image captioning datasets were
used for training.
• 'V': Only video captioning datasets were
used for training.
• 'B': Both image and video
captioning datasets were used for training.
9
TRECVID 2019
Run Types
• All runs in Matching and Ranking are of type
‘V’.
• For Description Generation the distribution is:
• Run type ‘I’: 1 run
• Run type ‘B’: 3 runs
• Run type ‘V’: 26 runs
10
TRECVID 2019
Subtask 1: Description Generation
11
“a dog is licking its nose”
Given a video
Generate a textual description
• Upto4runsintheDescriptionGenerationsubtask.
• Metricsusedforevaluation:
• CIDEr (Consensus-based Image Description Evaluation)
• METEOR (Metric for Evaluation of Translation with Explicit
Ordering)
• BLEU (BiLingual Evaluation Understudy)
• STS (Semantic Textual Similarity)
• DA (Direct Assessment), which is a crowdsourced rating of
captions using Amazon Mechanical Turk (AMT)
Who ? What ? Where ? When ?
TRECVID 2019
TRECVID 2019
12
TRECVID 2019
13
TRECVID 2019
14
TRECVID 2019
15
TRECVID 2019
16
Significance Test – CIDEr
TRECVID 2019
17
RUC_AIM3
UTS_ISA
FDU
RUCMM
PicSOM
EURECOM
KU_ISPL
KsLab
IMFD_IMPRESEE
Insight_DCU
RUC_AIM3
UTS_ISA
FDU
RUCMM
PicSOM
EURECOM
KU_ISPL
KsLab
IMFD_IMPRESEE
Insight_DCU
• Green squares indicate a significant “win”
for the row over column using the CIDEr
metric.
• Significance calculated at p<0.001
• RUC_AIM3 outperforms all other systems.
Metric Correlation
TRECVID 2019
18
CIDER CIDER-D METEOR BLEU STS_1 STS_2 STS_3 STS_4 STS_5
CIDER 1.000 0.964 0.923 0.902 0.929 0.900 0.910 0.887 0.900
CIDER-D 0.964 1.000 0.903 0.958 0.848 0.815 0.828 0.800 0.816
METEOR 0.923 0.903 1.000 0.850 0.928 0.916 0.921 0.891 0.904
BLEU 0.902 0.958 0.850 1.000 0.775 0.742 0.752 0.724 0.741
STS_1 0.929 0.848 0.928 0.775 1.000 0.997 0.998 0.990 0.994
STS_2 0.900 0.815 0.916 0.742 0.997 1.000 0.999 0.995 0.997
STS_3 0.910 0.828 0.921 0.752 0.998 0.999 1.000 0.995 0.997
STS_4 0.887 0.800 0.891 0.724 0.990 0.995 0.995 1.000 0.998
STS_5 0.900 0.816 0.904 0.741 0.994 0.997 0.997 0.998 1.000
Comparison with 2018
Metric 2018 2019
CIDEr 0.416 0.585
CIDEr-D 0.154 0.332
METEOR 0.231 0.306
BLEU 0.024 0.064
STS 0.433 0.484
TRECVID 2019
19
• Scores have increased across all metrics from last year.
• The table shows the maximum score for each metric from 2018 and 2019.
Direct Assessment (DA)
• DA uses crowdsourcing to evaluate how well a caption describes a
video.
• Human evaluators rate captions on a scale of 0 to 100.
• DA conducted on only primary runs for each team.
• Measures …
• RAW: Average DA score [0..100] for each system (non-standardized) – micro-
averaged per caption then overall average
• Z: Average DA score per system after standardization per individual AMT
worker’s mean and std. dev. score.
20
TRECVID 2019
TRECVID 2019
21
TRECVID 2019
22
What DA Results Tell
Us ..
• Green squares indicate a significant “win” for
the row over the column.
• No system yet reaches human performance.
• Humans B and E statistically perform better
than Humans C and D. This may not be
significant since each ‘Human’ system contains
multiple assessors.
• Amongst systems, RUC-AIM3 and RUCMM
outperform the rest, with significant wins.
TRECVID 2019 23
HUMAN.B
HUMAN.E
HUMAN.D
HUMAN.C
RUC_AIM3
RUCMM
UTS_ISA
FDU
EURECOM_MeMAD
KU_ISPL_prior
PicSOM_MeMAD
KsLab_s2s
IMFD_IMPRESEE_MSVD
Insight_DCU
Insight_DCU
IMFD_IMPRESEE_MSVD
KsLab_s2s
PicSOM_MeMAD
KU_ISPL_prior
EURECOM_MeMAD
FDU
UTS_ISA
RUCMM
RUC_AIM3
HUMAN−C
HUMAN−D
HUMAN−E
HUMAN−B
Correlation Between Metrics (Primary Runs)
TRECVID 2019
24
CIDER CIDER-D METEOR BLEU STS DA_Z
CIDER 1.000 0.972 0.963 0.902 0.937 0.874
CIDER-D 0.972 1.000 0.967 0.969 0.852 0.832
METEOR 0.963 0.967 1.000 0.936 0.863 0.763
BLEU 0.902 0.969 0.936 1.000 0.750 0.711
STS 0.937 0.852 0.863 0.750 1.000 0.812
DA_Z 0.874 0.832 0.763 0.711 0.812 1.000
TRECVID 2019
25
TRECVID 2019
26
TRECVID 2019
27
TRECVID 2019
28
TRECVID 2019
29
Flickr vs Vines
TRECVID 2019
30
Team Flickr Vines
IMFD_IMPRESEE 5.49 5.41
EURECOM 6.16 6.21
RUCMM 7.63 7.93
KU_ISPL 7.72 7.64
PicSOM 8.58 9.09
FDU 9.06 9.44
KsLab 9.50 9.95
Insight_DCU 11.59 12.23
RUC_AIM3 12.62 11.63
UTS_ISA 15.16 15.32
• Table 1 shows the average sentence
lengths for different runs over the
Flickr and Vines datasets.
• The GT average sentence lengths
are as follows:
• There is no significant difference to
show that the sentence length
played any role in score differences.
• It is difficult to reach a conclusion
regarding the difficulty/ease of one
dataset over the other.
Flickr Vines
17.48 18.85
Table 1
Top 3 Results – Description Generation
31
#1439 #1080
#826
TRECVID 2019
Assessor Captions:
1. White male teenager in a black jacket
playing a guitar and singing into a
microphone in a room
2. Young man sits in front of mike, strums
guitar, and sings.
3. A man plays guitar in front of a white
wall inside.
4. a young man in a room plays guitar
and sings into a microphone
5. A young man plays a guitar and sings a
song while looking at the camera.
Bottom 3 Results – Description Generation
32
#688 #1330
#913
TRECVID 2019
Assessor Captions:
1. Two knitted finger puppets rub against
each other in front of white cloth with
pink and yellow squares
2. two finger's dolls are hugging.
3. Two finger puppet cats, on beige and
white and on black and yellow,
embrace in front of a polka dot
background.
4. two finger puppets hugging each other
5. Two finger puppets embrace in front
of a background that is white with
colored blocks printed on it.
Example of System Captions
1. a man is singing and playing guitar
2. a man is playing a guitar and singing
3. a man is playing a guitar
4. a man is playing a guitar and playing the
guitar in front of a microphone
5. a man is sitting in a chair and playing a
guitar and singing
6. a young man singing into a microphone in
a room in front of a guitar
7. a man is sitting at a desk and talking
8. a man is talking about a video
33
TRECVID 2019
Observations – Description Generation
• This subtask captures the essence of the VTT task as
systems try to describe videos in natural language.
• It was made mandatory for VTT participants for the
first time.
• A number of metrics were used to evaluate results.
• For the first time, multiple video sources were used.
• No obvious advantage/disadvantage for the sources.
Probably because care is taken to get a diverse set of real
world videos.
34
TRECVID 2019
Subtask 2: Matching & Ranking
35
Person reading newspaper outdoors at daytime
Three men running in the street at daytime
Person playing golf outdoors in the field
Two men looking at laptop in an office
• Up to 4 runs per site were allowed in the Matching & Ranking subtask.
• Mean inverted rank used for evaluation.
• Five sets of descriptions used.
TRECVID 2019
TRECVID 2019
36
Top 3 Results
37
#13 #455
#32
TRECVID 2019
Bottom 3 Results
38
#1704 #1822
#205
TRECVID 2019
Observations – Matching and Ranking
• 4 teams participated in this optional task.
• The overall mean inverted rank score increased from
previous year. Table shows maximum scores for 2018
and 2019.
39
TRECVID 2019
2018 2019
Mean Inverted Rank 0.516 0.727
(Very) High Level Overview
of Approaches
TRECVID 2019
40
RUC_AIM3
• Matching & Ranking:
• Dual encoding module used.
• Given sequence of input features, 3 branches to encode global, temporal, and
local information.
• Encoded features are then concatenated and mapped into joint embedding
space.
TRECVID 2019
41
RUC_AIM3
• Description Generation:
• Video Semantic Encoding: Video features extracted in temporal and semantic
attention.
• Description Generation with temporal and semantic attention
• Reinforcement Learning Optimization: Fine tune captioning model through RL with
fluency and visual relevance rewards.
• Pre-trained language model for fluency.
• For visual relevance, matching and ranking model used such that embedding vectors should
be close in joint space.
• Ensemble: Various caption modules used. Then relevance used to rerank captions.
• Datasets Used:
• TGIF, MSR-VTT, VATEX, VTT 2016-17
TRECVID 2019
42
UTS_ISA
• Framework contains three parts:
• Extraction of high level visual and action features.
• Visual features: ResnetXt-WSL, EfficientNet
• Action + Temporal features: Kinect-i3d features
• LSTM based encoder-decoder framework to handle fusion and learning.
Recurrent neural network used.
• An expandable ensemble module used. A controllable beam search strategy
generates sentences of different lengths.
• Datasets Used:
• MSVD, MSR-VTT, VTT 2016-18
TRECVID 2019
43
RUCMM
• Matching & Ranking:
• Dual encoding used. BERT encoder included to improve dual encoding.
• Best result by combining models.
• Description Generation:
• Based on classical encoder-decoder framework.
• Video-side multi-level encoding branch of dual encoding framework utilized
instead of common mean pooling.
• Datasets Used:
• MSR-VTT, MSVD, TGIF, VTT-16
TRECVID 2019
44
DCU
• Commonly used BLSTM Network. C3D as input followed by soft
attention, which is fed again to a final LSTM.
• A beam search method is used to find the sentences with the highest
probability.
• Glove embedding for output words.
• Datasets Used:
• TGIF, VTT
TRECVID 2019
45
IMFD-IMPRESSEE
• Matching & Ranking:
• Deep learning model based on W2VV++ (developed for AVS).
• Extended by using Dense Trajectories as visual embedding to encode
temporal information of the video.
• K-means clustering to encode Dense Trajectories.
• Sentence and video embedding into a common vector space.
• Run without batch normalization performed better than with.
TRECVID 2019
46
IMFD-IMPRESSEE
• Description Generation:
• Semantic Compositional Network (SCN) to understand effectively individual
semantic concepts for videos.
• Then a recurrent encoder based on a bidirectional LSTM used.
• Datasets Used:
• MSR-VTT
TRECVID 2019
47
FDU
• For visual representation, used Inception-Resnet-V2 CNN pretrained
on the ImageNet dataset.
• Concept detection to remove gap between feature representation
and text domain.
• LSTM to generate sentences.
• Datasets Used:
• TGIF, VTT 2017
TRECVID 2019
48
KSLab, Nagaoka University of Technology
• The goal is to decrease processing time.
• System processes 5 consecutive frames from the beginning and end
of the video.
• Each frame converted to 2048 feature vector through Inception V3
Network. Encoder-decoder network is constructed by two LSTM
networks.
• No connection observed between video length and score.
• Datasets Used:
• TGIF, VTT 2016-17
TRECVID 2019
49
PicSOM and EURECOM
• Combined notebook paper. Tried to answer multiple research
questions.
• PicSOM
• Comparison of cross-entropy and self-critical training loss functions.
• Self-critical uses CIDER-D scores as reward in reinforcement learning.
• As expected, self-critical training works better.
• Use of both still image data and video features improves performance. For
still images, video features were non-informative.
TRECVID 2019
50
PicSOM and EURECOM
• EURECOM
• Experimented with the use of Curriculum Learning in video captioning.
• The idea is to present data in an ascending order of difficulty during training.
• Captions are translated into a list of indices – bigger index for less frequent
words.
• Score of sample is the maximum index of its caption.
• Video features extracted with an I3D neural network.
• The process does not seem to be beneficial.
• Datasets Used:
• MS-COCO, MSR-VTT, TGIF, MSVD, VTT 2018
TRECVID 2019
51
Conclusion
• Good number of participation. Task will be renewed.
• This year we used two video sources – Flickr and Vines.
• Each video had 5 annotations.
• Lots of available training sets.
• Multiple research questions on way to solve the VTT
task.
• Metric scores for Description Generation and Matching &
Ranking have increased over last year.
• A new dataset is in the works – Details to come.
52
TRECVID 2019
Discussion
• Is there value in the matching and ranking sub-task?
Should it be continued as an optional sub-task? Are any
teams interested in only this particular sub-task?
• Is the inclusion of run types valuable?
• We may add other popular metrics, such as SPICE. Any
suggestions for adding/removing metrics?
• What did individual teams learn?
• Do the participating teams have any suggestions to
improve the task?
53
TRECVID 2019

More Related Content

Similar to TRECVID 2019 Video to Text

Istqb question-paper-dump-13
Istqb question-paper-dump-13Istqb question-paper-dump-13
Istqb question-paper-dump-13
TestingGeeks
 
2016_Apres_Lares_EC_29set2016
2016_Apres_Lares_EC_29set20162016_Apres_Lares_EC_29set2016
2016_Apres_Lares_EC_29set2016
Eduardo Cazassa
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
VasileiosMezaris
 

Similar to TRECVID 2019 Video to Text (20)

Tokyo Video Tech #3 | Overseas Event Report | Taipei Video Technology #3, Str...
Tokyo Video Tech #3 | Overseas Event Report | Taipei Video Technology #3, Str...Tokyo Video Tech #3 | Overseas Event Report | Taipei Video Technology #3, Str...
Tokyo Video Tech #3 | Overseas Event Report | Taipei Video Technology #3, Str...
 
Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022
 
Istqb question-paper-dump-13
Istqb question-paper-dump-13Istqb question-paper-dump-13
Istqb question-paper-dump-13
 
FIDO Case Study: Performance Comparison of Mulitmodal Biometrics
FIDO Case Study: Performance Comparison of Mulitmodal BiometricsFIDO Case Study: Performance Comparison of Mulitmodal Biometrics
FIDO Case Study: Performance Comparison of Mulitmodal Biometrics
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022
 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
 
Testing with Limited, Vague, and Missing Requirements
Testing with Limited, Vague, and Missing RequirementsTesting with Limited, Vague, and Missing Requirements
Testing with Limited, Vague, and Missing Requirements
 
2016_Apres_Lares_EC_29set2016
2016_Apres_Lares_EC_29set20162016_Apres_Lares_EC_29set2016
2016_Apres_Lares_EC_29set2016
 
IceBreaker Solving Cold Start Problem For Video Recommendation Engines
IceBreaker  Solving Cold Start Problem For Video Recommendation EnginesIceBreaker  Solving Cold Start Problem For Video Recommendation Engines
IceBreaker Solving Cold Start Problem For Video Recommendation Engines
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
 
Video Content Identification using Video Signature: Survey
Video Content Identification using Video Signature: SurveyVideo Content Identification using Video Signature: Survey
Video Content Identification using Video Signature: Survey
 
Higher Homework
Higher HomeworkHigher Homework
Higher Homework
 
Unembedding embedded systems with TDD: Benefits of going beyond the make it w...
Unembedding embedded systems with TDD: Benefits of going beyond the make it w...Unembedding embedded systems with TDD: Benefits of going beyond the make it w...
Unembedding embedded systems with TDD: Benefits of going beyond the make it w...
 
Webinar - Developers Are Your Greatest AppSec Resource
Webinar - Developers Are Your Greatest AppSec ResourceWebinar - Developers Are Your Greatest AppSec Resource
Webinar - Developers Are Your Greatest AppSec Resource
 
AcademicProject
AcademicProjectAcademicProject
AcademicProject
 
Applications of Deep Learning in Construction Industry
Applications of Deep Learning in Construction IndustryApplications of Deep Learning in Construction Industry
Applications of Deep Learning in Construction Industry
 
2020 FRSecure CISSP Mentor Program - Class 3
2020 FRSecure CISSP Mentor Program - Class 3 2020 FRSecure CISSP Mentor Program - Class 3
2020 FRSecure CISSP Mentor Program - Class 3
 
Agile London at Ticketmaster
Agile London at TicketmasterAgile London at Ticketmaster
Agile London at Ticketmaster
 

More from George Awad

More from George Awad (17)

Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022
 
Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022
 
Video Search at TRECVID 2022
Video Search at TRECVID 2022Video Search at TRECVID 2022
Video Search at TRECVID 2022
 
TRECVID 2019 introduction slides
TRECVID 2019 introduction slidesTRECVID 2019 introduction slides
TRECVID 2019 introduction slides
 
TRECVID 2019 Instance Search
TRECVID 2019 Instance SearchTRECVID 2019 Instance Search
TRECVID 2019 Instance Search
 
TRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video SearchTRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video Search
 
[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search
 
[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search
 
Instance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVIDInstance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVID
 
Activity detection at TRECVID
Activity detection at TRECVIDActivity detection at TRECVID
Activity detection at TRECVID
 
Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)
 
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
 
TRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event DetectionTRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event Detection
 
TRECVID 2016 : Concept Localization
TRECVID 2016 : Concept LocalizationTRECVID 2016 : Concept Localization
TRECVID 2016 : Concept Localization
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
TRECVID 2016 : Ad-hoc Video Search
TRECVID 2016 : Ad-hoc Video Search TRECVID 2016 : Ad-hoc Video Search
TRECVID 2016 : Ad-hoc Video Search
 
TRECVID 2016 workshop
TRECVID 2016 workshopTRECVID 2016 workshop
TRECVID 2016 workshop
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

TRECVID 2019 Video to Text

  • 1. TRECVID 2019 Video to Text Description Asad A. Butt NIST;JohnsHopkinsUniversity George Awad NIST;GeorgetownUniversity Yvette Graham DublinCityUniversity 1 TRECVID 2019
  • 2. Goals and Motivations üMeasure how well an automatic system can describe a video in natural language. üMeasure how well an automatic system can match high-level textual descriptions to low-level computer vision features. üTransfer successful image captioning technology to the video domain. Real world Applications üVideo summarization üSupporting search and browsing üAccessibility - video description to the blind üVideo event prediction 2 TRECVID 2019
  • 3. • Systems are asked to submit results for two subtasks: 1. Description Generation (Core): Automatically generate a text description for each video. 2. Matching & Ranking (Optional): Return for each video a ranked list of the most likely text description from each of the five sets. 3 SUBTASKS TRECVID 2019
  • 4. Video Dataset The VTT data for 2019 consisted of two video sources: • Twitter Vine: • Crawled 50k+ Twitter Vine video URLs. • Approximate video duration is 6 seconds. • Selected 1044 Vine videos for this year’s task. • Used since inception of VTT task. • Flickr: • Flickr video was collected under the Creative Commons License. • A set of 91 videos was collected, which was divided into 74,958 segments. • Approximate video duration is 10 seconds. • Selected 1010 segments. 4 TRECVID 2019
  • 5. Dataset Cleaning § Before selecting the dataset, we clustered videos based on visual similarity. § Resulted in the removal of duplicate videos, as well as those which were very visually similar (e.g. soccer games), resulting in a more diverse set of videos. § Then, we manually went through large collection of videos. § Used list of commonly appearing topics to filter videos. § Removed videos with multiple, unrelated segments that are hard to describe. § Removed any animated (or otherwise unsuitable) videos. 5 TRECVID 2019
  • 6. Annotation Process • A total of 10 assessors annotated the videos. • Each video was annotated by 5 assessors. • Annotation guidelines by NIST: • For each video, annotators were asked to combine 4 facets if applicable: • Who is the video showing (objects, persons, animals, …etc) ? • What are the objects and beings doing (actions, states, events, …etc)? • Where (locale, site, place, geographic, ...etc) ? • When (time of day, season, ...etc) ? 6 TRECVID 2019
  • 7. Annotation – Observations • Questions asked: • Q1 Avg Score: 2.03 (scale of 5) • Q2 Avg Score: 2.51 (scale of 3) • Correlation between difficulty scores: -0.72 • Average Sentence Length for each assessor: TRECVID 2019 7 Assessor # Avg. Length 1 17.72 2 19.55 3 18.76 4 22.07 5 20.42 6 12.83 7 16.07 8 21.73 9 16.49 10 21.16 1 2 3 4 5 1 2 3
  • 8. 2019 Participants (10 teams finished) Matching & Ranking (11 Runs) Description Generation (30 Runs) IMFD_IMPRESEE P P KSLAB P P RUCMM P P RUC_AIM3 P P EURECOM_MeMAD P FDU P INSIGHT_DCU P KU_ISPL P PICSOM P UTS_ISA P 8 TRECVID 2019
  • 9. Run Types • Each run was classified by the following run type: • 'I': Only image captioning datasets were used for training. • 'V': Only video captioning datasets were used for training. • 'B': Both image and video captioning datasets were used for training. 9 TRECVID 2019
  • 10. Run Types • All runs in Matching and Ranking are of type ‘V’. • For Description Generation the distribution is: • Run type ‘I’: 1 run • Run type ‘B’: 3 runs • Run type ‘V’: 26 runs 10 TRECVID 2019
  • 11. Subtask 1: Description Generation 11 “a dog is licking its nose” Given a video Generate a textual description • Upto4runsintheDescriptionGenerationsubtask. • Metricsusedforevaluation: • CIDEr (Consensus-based Image Description Evaluation) • METEOR (Metric for Evaluation of Translation with Explicit Ordering) • BLEU (BiLingual Evaluation Understudy) • STS (Semantic Textual Similarity) • DA (Direct Assessment), which is a crowdsourced rating of captions using Amazon Mechanical Turk (AMT) Who ? What ? Where ? When ? TRECVID 2019
  • 17. Significance Test – CIDEr TRECVID 2019 17 RUC_AIM3 UTS_ISA FDU RUCMM PicSOM EURECOM KU_ISPL KsLab IMFD_IMPRESEE Insight_DCU RUC_AIM3 UTS_ISA FDU RUCMM PicSOM EURECOM KU_ISPL KsLab IMFD_IMPRESEE Insight_DCU • Green squares indicate a significant “win” for the row over column using the CIDEr metric. • Significance calculated at p<0.001 • RUC_AIM3 outperforms all other systems.
  • 18. Metric Correlation TRECVID 2019 18 CIDER CIDER-D METEOR BLEU STS_1 STS_2 STS_3 STS_4 STS_5 CIDER 1.000 0.964 0.923 0.902 0.929 0.900 0.910 0.887 0.900 CIDER-D 0.964 1.000 0.903 0.958 0.848 0.815 0.828 0.800 0.816 METEOR 0.923 0.903 1.000 0.850 0.928 0.916 0.921 0.891 0.904 BLEU 0.902 0.958 0.850 1.000 0.775 0.742 0.752 0.724 0.741 STS_1 0.929 0.848 0.928 0.775 1.000 0.997 0.998 0.990 0.994 STS_2 0.900 0.815 0.916 0.742 0.997 1.000 0.999 0.995 0.997 STS_3 0.910 0.828 0.921 0.752 0.998 0.999 1.000 0.995 0.997 STS_4 0.887 0.800 0.891 0.724 0.990 0.995 0.995 1.000 0.998 STS_5 0.900 0.816 0.904 0.741 0.994 0.997 0.997 0.998 1.000
  • 19. Comparison with 2018 Metric 2018 2019 CIDEr 0.416 0.585 CIDEr-D 0.154 0.332 METEOR 0.231 0.306 BLEU 0.024 0.064 STS 0.433 0.484 TRECVID 2019 19 • Scores have increased across all metrics from last year. • The table shows the maximum score for each metric from 2018 and 2019.
  • 20. Direct Assessment (DA) • DA uses crowdsourcing to evaluate how well a caption describes a video. • Human evaluators rate captions on a scale of 0 to 100. • DA conducted on only primary runs for each team. • Measures … • RAW: Average DA score [0..100] for each system (non-standardized) – micro- averaged per caption then overall average • Z: Average DA score per system after standardization per individual AMT worker’s mean and std. dev. score. 20 TRECVID 2019
  • 23. What DA Results Tell Us .. • Green squares indicate a significant “win” for the row over the column. • No system yet reaches human performance. • Humans B and E statistically perform better than Humans C and D. This may not be significant since each ‘Human’ system contains multiple assessors. • Amongst systems, RUC-AIM3 and RUCMM outperform the rest, with significant wins. TRECVID 2019 23 HUMAN.B HUMAN.E HUMAN.D HUMAN.C RUC_AIM3 RUCMM UTS_ISA FDU EURECOM_MeMAD KU_ISPL_prior PicSOM_MeMAD KsLab_s2s IMFD_IMPRESEE_MSVD Insight_DCU Insight_DCU IMFD_IMPRESEE_MSVD KsLab_s2s PicSOM_MeMAD KU_ISPL_prior EURECOM_MeMAD FDU UTS_ISA RUCMM RUC_AIM3 HUMAN−C HUMAN−D HUMAN−E HUMAN−B
  • 24. Correlation Between Metrics (Primary Runs) TRECVID 2019 24 CIDER CIDER-D METEOR BLEU STS DA_Z CIDER 1.000 0.972 0.963 0.902 0.937 0.874 CIDER-D 0.972 1.000 0.967 0.969 0.852 0.832 METEOR 0.963 0.967 1.000 0.936 0.863 0.763 BLEU 0.902 0.969 0.936 1.000 0.750 0.711 STS 0.937 0.852 0.863 0.750 1.000 0.812 DA_Z 0.874 0.832 0.763 0.711 0.812 1.000
  • 30. Flickr vs Vines TRECVID 2019 30 Team Flickr Vines IMFD_IMPRESEE 5.49 5.41 EURECOM 6.16 6.21 RUCMM 7.63 7.93 KU_ISPL 7.72 7.64 PicSOM 8.58 9.09 FDU 9.06 9.44 KsLab 9.50 9.95 Insight_DCU 11.59 12.23 RUC_AIM3 12.62 11.63 UTS_ISA 15.16 15.32 • Table 1 shows the average sentence lengths for different runs over the Flickr and Vines datasets. • The GT average sentence lengths are as follows: • There is no significant difference to show that the sentence length played any role in score differences. • It is difficult to reach a conclusion regarding the difficulty/ease of one dataset over the other. Flickr Vines 17.48 18.85 Table 1
  • 31. Top 3 Results – Description Generation 31 #1439 #1080 #826 TRECVID 2019 Assessor Captions: 1. White male teenager in a black jacket playing a guitar and singing into a microphone in a room 2. Young man sits in front of mike, strums guitar, and sings. 3. A man plays guitar in front of a white wall inside. 4. a young man in a room plays guitar and sings into a microphone 5. A young man plays a guitar and sings a song while looking at the camera.
  • 32. Bottom 3 Results – Description Generation 32 #688 #1330 #913 TRECVID 2019 Assessor Captions: 1. Two knitted finger puppets rub against each other in front of white cloth with pink and yellow squares 2. two finger's dolls are hugging. 3. Two finger puppet cats, on beige and white and on black and yellow, embrace in front of a polka dot background. 4. two finger puppets hugging each other 5. Two finger puppets embrace in front of a background that is white with colored blocks printed on it.
  • 33. Example of System Captions 1. a man is singing and playing guitar 2. a man is playing a guitar and singing 3. a man is playing a guitar 4. a man is playing a guitar and playing the guitar in front of a microphone 5. a man is sitting in a chair and playing a guitar and singing 6. a young man singing into a microphone in a room in front of a guitar 7. a man is sitting at a desk and talking 8. a man is talking about a video 33 TRECVID 2019
  • 34. Observations – Description Generation • This subtask captures the essence of the VTT task as systems try to describe videos in natural language. • It was made mandatory for VTT participants for the first time. • A number of metrics were used to evaluate results. • For the first time, multiple video sources were used. • No obvious advantage/disadvantage for the sources. Probably because care is taken to get a diverse set of real world videos. 34 TRECVID 2019
  • 35. Subtask 2: Matching & Ranking 35 Person reading newspaper outdoors at daytime Three men running in the street at daytime Person playing golf outdoors in the field Two men looking at laptop in an office • Up to 4 runs per site were allowed in the Matching & Ranking subtask. • Mean inverted rank used for evaluation. • Five sets of descriptions used. TRECVID 2019
  • 37. Top 3 Results 37 #13 #455 #32 TRECVID 2019
  • 38. Bottom 3 Results 38 #1704 #1822 #205 TRECVID 2019
  • 39. Observations – Matching and Ranking • 4 teams participated in this optional task. • The overall mean inverted rank score increased from previous year. Table shows maximum scores for 2018 and 2019. 39 TRECVID 2019 2018 2019 Mean Inverted Rank 0.516 0.727
  • 40. (Very) High Level Overview of Approaches TRECVID 2019 40
  • 41. RUC_AIM3 • Matching & Ranking: • Dual encoding module used. • Given sequence of input features, 3 branches to encode global, temporal, and local information. • Encoded features are then concatenated and mapped into joint embedding space. TRECVID 2019 41
  • 42. RUC_AIM3 • Description Generation: • Video Semantic Encoding: Video features extracted in temporal and semantic attention. • Description Generation with temporal and semantic attention • Reinforcement Learning Optimization: Fine tune captioning model through RL with fluency and visual relevance rewards. • Pre-trained language model for fluency. • For visual relevance, matching and ranking model used such that embedding vectors should be close in joint space. • Ensemble: Various caption modules used. Then relevance used to rerank captions. • Datasets Used: • TGIF, MSR-VTT, VATEX, VTT 2016-17 TRECVID 2019 42
  • 43. UTS_ISA • Framework contains three parts: • Extraction of high level visual and action features. • Visual features: ResnetXt-WSL, EfficientNet • Action + Temporal features: Kinect-i3d features • LSTM based encoder-decoder framework to handle fusion and learning. Recurrent neural network used. • An expandable ensemble module used. A controllable beam search strategy generates sentences of different lengths. • Datasets Used: • MSVD, MSR-VTT, VTT 2016-18 TRECVID 2019 43
  • 44. RUCMM • Matching & Ranking: • Dual encoding used. BERT encoder included to improve dual encoding. • Best result by combining models. • Description Generation: • Based on classical encoder-decoder framework. • Video-side multi-level encoding branch of dual encoding framework utilized instead of common mean pooling. • Datasets Used: • MSR-VTT, MSVD, TGIF, VTT-16 TRECVID 2019 44
  • 45. DCU • Commonly used BLSTM Network. C3D as input followed by soft attention, which is fed again to a final LSTM. • A beam search method is used to find the sentences with the highest probability. • Glove embedding for output words. • Datasets Used: • TGIF, VTT TRECVID 2019 45
  • 46. IMFD-IMPRESSEE • Matching & Ranking: • Deep learning model based on W2VV++ (developed for AVS). • Extended by using Dense Trajectories as visual embedding to encode temporal information of the video. • K-means clustering to encode Dense Trajectories. • Sentence and video embedding into a common vector space. • Run without batch normalization performed better than with. TRECVID 2019 46
  • 47. IMFD-IMPRESSEE • Description Generation: • Semantic Compositional Network (SCN) to understand effectively individual semantic concepts for videos. • Then a recurrent encoder based on a bidirectional LSTM used. • Datasets Used: • MSR-VTT TRECVID 2019 47
  • 48. FDU • For visual representation, used Inception-Resnet-V2 CNN pretrained on the ImageNet dataset. • Concept detection to remove gap between feature representation and text domain. • LSTM to generate sentences. • Datasets Used: • TGIF, VTT 2017 TRECVID 2019 48
  • 49. KSLab, Nagaoka University of Technology • The goal is to decrease processing time. • System processes 5 consecutive frames from the beginning and end of the video. • Each frame converted to 2048 feature vector through Inception V3 Network. Encoder-decoder network is constructed by two LSTM networks. • No connection observed between video length and score. • Datasets Used: • TGIF, VTT 2016-17 TRECVID 2019 49
  • 50. PicSOM and EURECOM • Combined notebook paper. Tried to answer multiple research questions. • PicSOM • Comparison of cross-entropy and self-critical training loss functions. • Self-critical uses CIDER-D scores as reward in reinforcement learning. • As expected, self-critical training works better. • Use of both still image data and video features improves performance. For still images, video features were non-informative. TRECVID 2019 50
  • 51. PicSOM and EURECOM • EURECOM • Experimented with the use of Curriculum Learning in video captioning. • The idea is to present data in an ascending order of difficulty during training. • Captions are translated into a list of indices – bigger index for less frequent words. • Score of sample is the maximum index of its caption. • Video features extracted with an I3D neural network. • The process does not seem to be beneficial. • Datasets Used: • MS-COCO, MSR-VTT, TGIF, MSVD, VTT 2018 TRECVID 2019 51
  • 52. Conclusion • Good number of participation. Task will be renewed. • This year we used two video sources – Flickr and Vines. • Each video had 5 annotations. • Lots of available training sets. • Multiple research questions on way to solve the VTT task. • Metric scores for Description Generation and Matching & Ranking have increased over last year. • A new dataset is in the works – Details to come. 52 TRECVID 2019
  • 53. Discussion • Is there value in the matching and ranking sub-task? Should it be continued as an optional sub-task? Are any teams interested in only this particular sub-task? • Is the inclusion of run types valuable? • We may add other popular metrics, such as SPICE. Any suggestions for adding/removing metrics? • What did individual teams learn? • Do the participating teams have any suggestions to improve the task? 53 TRECVID 2019