SlideShare a Scribd company logo
1 of 42
Download to read offline
TRECVID 2018
Video to Text Description
AsadA. Butt
NIST
GeorgeAwad
NIST; Dakota Consulting, Inc
Alan Smeaton
Dublin City University
1TRECVID 2018
Disclaimer: Certain commercial entities, equipment, or materials may be identified in
this document in order to describe an experimental procedure or concept adequately.
Such identification is not intended to imply recommendation or endorsement by the
National Institute of Standards, nor is it intended to imply that the entities, materials,
or equipment are necessarily the best available for the purpose.
Goals and Motivations
✓Measure how well an automatic system can describe a video in
natural language.
✓Measure how well an automatic system can match high-level
textual descriptions to low-level computer vision features.
✓Transfer successful image captioning technology to the video
domain.
Real world Applications
✓Video summarization
✓Supporting search and browsing
✓Accessibility - video description to the blind
✓Video event prediction
2TRECVID 2018
• Systems are asked to submit results for two
subtasks:
1. Matching & Ranking:
Return for each URL a ranked list of the most likely text
description from each of the five sets.
2. Description Generation:
Automatically generate a text description for each URL.
3
TASKS
TRECVID 2018
Video Dataset
• Crawled 50k+ Twitter Vine video URLs.
• Max video duration == 6 sec.
• A subset of 2000 URLs (quasi) randomly selected, divided
amongst 10 assessors.
• Significant preprocessing to remove unsuitable videos.
• Final dataset included 1903 URLs due to removal of
videos from Vine.
4TRECVID 2018
Steps to Remove Redundancy
▪ Before selecting the dataset, we clustered videos based
on visual similarity.
▪ Used a tool called SOTU [1], which used Visual Bag of Words to
cluster videos with 60% similarity for at least 3 frames.
▪ Resulted in the removal of duplicate videos, as well as those which
were very visually similar (e.g. soccer games), resulting in a more
diverse set of videos.
TRECVID 2018 5
[1] Zhao, Wan-Lei and Ngo Chong-Wah. "SOTU in Action." (2012).
Dataset Cleaning
▪ Dataset Creation Process: Manually went through large
collection of videos.
▪ Used list of commonly appearing videos from last year to select a
diverse set of videos.
▪ Removed videos with multiple, unrelated segments that are hard to
describe.
▪ Removed any animated (or otherwise unsuitable) videos.
▪ Resulted in a much cleaner dataset.
TRECVID 2018 6
Annotation Process
• Each video was annotated by 5 assessors.
• Annotation guidelines by NIST:
• For each video, annotators were asked to combine 4 facets if
applicable:
• Who is the video describing (objects, persons, animals, …etc) ?
• What are the objects and beings doing (actions, states, events,
…etc)?
• Where (locale, site, place, geographic, ...etc) ?
• When (time of day, season, ...etc) ?
TRECVID 2018 7
Annotation Process – Observations
1. Different assessors provide varying amount of detail
when describing videos. Some assessors had very
long sentences to incorporate all information, while
others gave a brief description.
2. Assessors interpret scenes according to cultural or
pop cultural references, not universally recognized.
3. Specifying the time of the day was often not possible
for indoor videos.
4. Given the removal of videos with multiple disjointed
scenes, assessors were better able to provide
descriptions.
TRECVID 2018 8
Sample Captions of 5 Assessors
TRECVID 2018 9
1. Orange car #1 on gray day drives around curve in
road race test.
2. Orange car drives on wet road curve with
observers.
3. An orange car with black roof, is driving around a
curve on the road, while a person, wearing grey is
observing it.
4. The orange car is driving on the road and going
around a curve.
5. Advertisement for automobile mountain race
showing the orange number one car navigating a
curve on the mountain during the race in the
evening; an individual is observing the vehicle
dressed in jeans and cold weather coat.
1. A woman lets go of a brown ball attached to
overhead wire that comes back and hits her in the
face.
2. In a room, a bowling ball on a string swings and its
a woman with a white shirt on in the face.
3. During a demonstration a white woman with black
hair wearing a white top and holding a ball tether
to a line from above as the demonstrator tells her
to let go of the ball which returns on its tether and
hits the woman in the face.
4. A man in blue holds a ball on a cord and lets it
swing, and it comes back and hits a woman in
white in the face.
5. A young girl, before an audience of students,
allows a pendulum to swing from her face and all
are surprised when it returns to strike her.
2018 Participants (12 teams finished)
Matching & Ranking (26 Runs) Description Generation (24 Runs)
INF P P
KSLAB P P
KU_ISPL P P
MMSys_CCMIP P P
NTU_ROSE P P
PicSOM P
UPCer P
UTS_CETC_D2DCRC_
CAI
P P
EURECOM P
ORAND P
RUCMM P
UCR_VCG P
10TRECVID 2018
Sub-task 1: Matching & Ranking
11TRECVID 2018
Person reading newspaper outdoors at daytime
Three men running in the street at daytime
Person playing golf outdoors in the field
Two men looking at laptop in an office
• Up to 4 runs per site were allowed in the Matching & Ranking subtask.
• Mean inverted rank used for evaluation.
• Five sets of descriptions used.
Matching & Ranking Results – Set A
12TRECVID 2018
0
0.1
0.2
0.3
0.4
0.5
0.6
Run 1
Run 2
Run 3
Run 4
MeanInvertedRank
Matching & Ranking Results – Set B
13TRECVID 2018
0
0.1
0.2
0.3
0.4
0.5
0.6
Run 1
Run 2
Run 3
Run 4
MeanInvertedRank
Matching & Ranking Results – Set C
14TRECVID 2018
0
0.1
0.2
0.3
0.4
0.5
0.6
Run 1
Run 2
Run 3
Run 4
MeanInvertedRank
Matching & Ranking Results – Set D
15TRECVID 2018
0
0.1
0.2
0.3
0.4
0.5
0.6
Run 1
Run 2
Run 3
Run 4
MeanInvertedRank
Matching & Ranking Results – Set E
16TRECVID 2018
0
0.1
0.2
0.3
0.4
0.5
0.6
Run 1
Run 2
Run 3
Run 4
MeanInvertedRank
Systems Rankings for each Set
A B C D E
RUCMM RUCMM RUCMM RUCMM RUCMM
INF INF INF INF INF
EURECOM EURECOM EURECOM EURECOM EURECOM
UCR_VCG UCR_VCG UCR_VCG UCR_VCG UCR_VCG
NTU_ROSE KU_ISPL ORAND KU_ISPL KU_ISPL
KU_ISPL ORAND KU_ISPL ORAND ORAND
ORAND NTU_ROSE NTU_ROSE KSLAB KSLAB
KSLAB
UTS_CETC_D2DCR
C_CAI
KSLAB NTU_ROSE
UTS_CETC_D2DCR
C_CAI
UTS_CETC_D2DCR
C_CAI
KSLAB
UTS_CETC_D2DCR
C_CAI
UTS_CETC_D2DCR
C_CAI
NTU_ROSE
MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP
TRECVID 2018 17
Not much difference between these runs.
Top 3 Results
TRECVID 2018 18
#1874 #1681
#598
Bottom 3 Results
TRECVID 2018 19
#1029 #958
#1825
Sub-task 2: Description Generation
TRECVID 2018 20
“a dog is licking its nose”
Given a video
Generate a textual description
• Up to 4 runs in the Description Generation subtask.
• Metrics used for evaluation:
• BLEU (BiLingual Evaluation Understudy)
• METEOR (Metric for Evaluation of Translation with Explicit
Ordering)
• CIDEr (Consensus-based Image Description Evaluation)
• STS (Semantic Textual Similarity)
• DA (Direct Assessment), which is a crowdsourced rating of
captions using Amazon Mechanical Turk (AMT)
• Run Types
• V (Vine videos used for training)
• N (Only non-Vine videos used for training)
Who ? What ? Where ? When ?
CIDEr Results
TRECVID 2018 21
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Run 1 Run 2 Run 3 Run 4
CIDEr-D Results
TRECVID 2018 22
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Run 1 Run 2 Run 3 Run 4
METEOR Results
TRECVID 2018 23
0
0.05
0.1
0.15
0.2
0.25
Run 1 Run 2 Run 3 Run 4
BLEU Results
TRECVID 2018 24
0
0.005
0.01
0.015
0.02
0.025
0.03
Run 1 Run 2 Run 3 Run 4
STS Results
TRECVID 2018 25
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Run 1 Run 2 Run 3 Run 4
CIDEr Results – Run Type
TRECVID 2018 26
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
V N
Direct Assessment (DA)
• Measures …
• RAW: Average DA score [0..100] for each system (non-
standardised) – micro-averaged per caption then overall
average
• Z: Average DA score per system after standardisation
per individual AMT worker’s mean and std. dev. score.
TRECVID 2018 29
DA results - Raw
TRECVID 2018 30
0
10
20
30
40
50
60
70
80
90
100
Raw
DA results - Z
TRECVID 2018 31
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Z
What DA Results Tell Us ..
TRECVID 2018 33
1. Green squares
indicate a
significant “win” for
the row over the
column.
2. No system yet
reaches human
performance.
3. Humans B and E
statistically perform
better than Human
D.
4. Amongst systems,
INF outperforms
the rest.
Systems Rankings for each Metric
CIDEr CIDEr-D METEOR BLEU STS DA
INF INF INF INF INF INF
UTS_CETC_D2
DCRC_CAI
UTS_CETC_D2
DCRC_CAI
UTS_CETC_D2
DCRC_CAI
UTS_CETC_D2
DCRC_CAI
UTS_CETC_D2
DCRC_CAI
UTS_CETC_D2
DCRC_CAI
NTU_ROSE UPCer UPCer UPCer PicSOM UPCer
PicSOM KSLAB PicSOM PicSOM NTU_ROSE PicSOM
UPCer PicSOM KU_ISPL KSLAB UPCer KU_ISPL
KSLAB NTU_ROSE KSLAB KU_ISPL KU_ISPL KSLAB
KU_ISPL KU_ISPL NTU_ROSE NTU_ROSE KSLAB NTU_ROSE
MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP
TRECVID 2018 34
Observations
• The task continues to evolve as the number of
annotations per video were standardized to 5
(compare to last year’s task).
• Tried to remove redundancy and create a
diverse set with little or no ambiguity for
matching sub-task.
• Steps were taken to ensure that a cleaner
dataset was used for the task.
TRECVID 2018 36
Participants
• Teams that will present today:
• RUCMM
• KU_ISPL
• INF
• Very high level bullets on approaches by other teams.
TRECVID 2018 37
UTS_CETC_D2DCRC
• Widely used LSTM based sequence to sequence model.
• Focus on improving generalization ability of the model.
• Different training strategies used.
• Several combinations of spatial and temporal features are
ensembled together.
• Simple model structure preferred to help generalization
ability.
• Training data: MSVD, MSR-VTT 2016, TGIF, VTT 2016,
VTT 2017
TRECVID 2018 38
PicSOM
Description Generation
• LSTM recurrent neural networks used to generate
descriptions using multi-modal features.
• Visual features include image and video features and
trajectory features.
• Audio features also used.
• Training datasets used: MS COCO, MSR-VTT, TGIF,
MSVD.
• Significant improvement by expanding MSR-VTT training
dataset with MS COCO.
TRECVID 2018 39
KSLAB
• The main idea is to extract representations from only key
frames.
• Key frames are detected for different types of events.
• The method uses a CNN encoder and LSTM decoder.
• Model trained using MS COCO dataset.
TRECVID 2018 40
NTU_ROSE
• Matching & Ranking
• Trained 2 different models on MS COCO dataset.
• Image based retrieval methods found suitable.
• Description Generation
• Training dataset: MSR-VTT and MSVD.
• CST-captioning (Consensus-based Sequence Training) used as
baseline and adapted.
• Both visual and audio features used.
• Model trained on MSR-VTT performed better, probably because it
generates longer sentences than one trained on MSVD.
TRECVID 2018 41
MMSys
• Matching & Ranking
• Wikipedia and Pascal Sentence datasets used for training.
• Used pre-trained cross-modal retrieval method for matching task.
• Description Generation
• MSR-VTT dataset used for training.
• Extract 1 fps per video and used pre-trained Inception-ResNetV2 to
extract features.
• Used sen2vec for text features.
• Model trained on frame and text features.
TRECVID 2018 42
EURECOM
Matching & Ranking
• Improved approach of best team of 2017 (DL-61-86).
• Feature vectors derived from frames extracted at 2 fps using final
layer of ResNet-152.
• Contextualized features obtained and combined through soft
attention mechanism.
• Resulting vector v fed into two fully connected layers using RELU
activation.
• Vector v concatenated with vector from last layer of an
RGB-I3D.
• Instead of using Res-Net152 trained on ImageNet, it is
also finetuned on MSCOCO.
TRECVID 2018 43
UCR_VCG
Matching & Ranking
• MS-COCO dataset used for training.
• Keyframes extracted from videos – representative frames
• A joint image-text embedding approach used to match
videos to descriptions.
TRECVID 2018 44
Conclusion
• Good number of participation. Task will be renewed.
• This year we had more annotations per video.
• A cleaner dataset created.
• Direct Assessment was used for a second year running.
This year we included multiple human responses. The
results are interesting.
• Lots of available training sets, some overlap ... MSR-
VTT, MS-COCO, ImageNet, YouTube2Text, MSVD,
TRECVid2016-2017 VTT, TGIF
• Some teams used audio features in addition to visual
features.
TRECVID 2018 45
Discussion
• Is there value in the caption ranking sub-task? Should it
be continued, especially with some teams participating
only in this subtask?
• Is the inclusion of run type (N or V) valuable?
• Other possible run types? Video datasets only vs. video + image
captioning training datasets.
• Possibilities for a new dataset?
• Are more teams planning to use audio features? What
about motion from video?
• What did individual teams learn?
TRECVID 2018 46

More Related Content

Similar to [TRECVID 2018] Video to Text

ATI's Total Systems Engineering Development & Management technical training c...
ATI's Total Systems Engineering Development & Management technical training c...ATI's Total Systems Engineering Development & Management technical training c...
ATI's Total Systems Engineering Development & Management technical training c...Jim Jenkins
 
Total systems engineering_development_management_course_sampler
Total systems engineering_development_management_course_samplerTotal systems engineering_development_management_course_sampler
Total systems engineering_development_management_course_samplerJim Jenkins
 
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?TechWell
 
DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...DBOnto
 
Diadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meetingDiadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meetingDBOnto
 
Six Sigma in Healthcare
Six Sigma in HealthcareSix Sigma in Healthcare
Six Sigma in Healthcareljmcneill33
 
Deliver Superior Outcomes Using HBT Visualization Tool
Deliver Superior Outcomes Using HBT Visualization ToolDeliver Superior Outcomes Using HBT Visualization Tool
Deliver Superior Outcomes Using HBT Visualization ToolSTAG Software Private Limited
 
Discussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docxDiscussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docxelinoraudley582231
 
Istqb question-paper-dump-13
Istqb question-paper-dump-13Istqb question-paper-dump-13
Istqb question-paper-dump-13TestingGeeks
 
Graduation Project - Face Login : A Robust Face Identification System for Sec...
Graduation Project - Face Login : A Robust Face Identification System for Sec...Graduation Project - Face Login : A Robust Face Identification System for Sec...
Graduation Project - Face Login : A Robust Face Identification System for Sec...Ahmed Gad
 
Seven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAASeven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAAJames Lawlor
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Define phase lean six sigma tollgate template
Define phase   lean six sigma tollgate templateDefine phase   lean six sigma tollgate template
Define phase lean six sigma tollgate templateSteven Bonacorsi
 
New 7QC tools for the quality person during RCa
New 7QC tools for the quality person during RCaNew 7QC tools for the quality person during RCa
New 7QC tools for the quality person during RCaAravindhNagaraj1
 
Recommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.pptRecommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.pptXanat V. Meza
 
Internship report.pdf
Internship report.pdfInternship report.pdf
Internship report.pdfReshmaKhunnam
 

Similar to [TRECVID 2018] Video to Text (20)

Introduction to Six Sigma
Introduction to Six SigmaIntroduction to Six Sigma
Introduction to Six Sigma
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
ATI's Total Systems Engineering Development & Management technical training c...
ATI's Total Systems Engineering Development & Management technical training c...ATI's Total Systems Engineering Development & Management technical training c...
ATI's Total Systems Engineering Development & Management technical training c...
 
Total systems engineering_development_management_course_sampler
Total systems engineering_development_management_course_samplerTotal systems engineering_development_management_course_sampler
Total systems engineering_development_management_course_sampler
 
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
Rise of the Machines: Can Artificial Intelligence Terminate Manual Testing?
 
DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...
 
Diadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meetingDiadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meeting
 
Six Sigma in Healthcare
Six Sigma in HealthcareSix Sigma in Healthcare
Six Sigma in Healthcare
 
Deliver Superior Outcomes Using HBT Visualization Tool
Deliver Superior Outcomes Using HBT Visualization ToolDeliver Superior Outcomes Using HBT Visualization Tool
Deliver Superior Outcomes Using HBT Visualization Tool
 
Discussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docxDiscussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docx
 
Istqb question-paper-dump-13
Istqb question-paper-dump-13Istqb question-paper-dump-13
Istqb question-paper-dump-13
 
Graduation Project - Face Login : A Robust Face Identification System for Sec...
Graduation Project - Face Login : A Robust Face Identification System for Sec...Graduation Project - Face Login : A Robust Face Identification System for Sec...
Graduation Project - Face Login : A Robust Face Identification System for Sec...
 
Seven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAASeven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAA
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Requirement analysis
Requirement analysisRequirement analysis
Requirement analysis
 
Define phase lean six sigma tollgate template
Define phase   lean six sigma tollgate templateDefine phase   lean six sigma tollgate template
Define phase lean six sigma tollgate template
 
New 7QC tools for the quality person during RCa
New 7QC tools for the quality person during RCaNew 7QC tools for the quality person during RCa
New 7QC tools for the quality person during RCa
 
American sign language recognizer
American sign language recognizerAmerican sign language recognizer
American sign language recognizer
 
Recommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.pptRecommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.ppt
 
Internship report.pdf
Internship report.pdfInternship report.pdf
Internship report.pdf
 

More from George Awad

Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022George Awad
 
Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022George Awad
 
Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022George Awad
 
Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022George Awad
 
Video Search at TRECVID 2022
Video Search at TRECVID 2022Video Search at TRECVID 2022
Video Search at TRECVID 2022George Awad
 
TRECVID 2019 introduction slides
TRECVID 2019 introduction slidesTRECVID 2019 introduction slides
TRECVID 2019 introduction slidesGeorge Awad
 
TRECVID 2019 Instance Search
TRECVID 2019 Instance SearchTRECVID 2019 Instance Search
TRECVID 2019 Instance SearchGeorge Awad
 
TRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video SearchTRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video SearchGeorge Awad
 
[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search[TRECVID 2018] Instance Search
[TRECVID 2018] Instance SearchGeorge Awad
 
[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video SearchGeorge Awad
 
Instance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVIDInstance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVIDGeorge Awad
 
Activity detection at TRECVID
Activity detection at TRECVIDActivity detection at TRECVID
Activity detection at TRECVIDGeorge Awad
 
The Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVIDThe Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVIDGeorge Awad
 
Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)George Awad
 
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)George Awad
 
TRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event DetectionTRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event DetectionGeorge Awad
 
TRECVID 2016 : Concept Localization
TRECVID 2016 : Concept LocalizationTRECVID 2016 : Concept Localization
TRECVID 2016 : Concept LocalizationGeorge Awad
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchGeorge Awad
 
TRECVID 2016 : Ad-hoc Video Search
TRECVID 2016 : Ad-hoc Video Search TRECVID 2016 : Ad-hoc Video Search
TRECVID 2016 : Ad-hoc Video Search George Awad
 
TRECVID 2016 workshop
TRECVID 2016 workshopTRECVID 2016 workshop
TRECVID 2016 workshopGeorge Awad
 

More from George Awad (20)

Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022Movie Summarization at TRECVID 2022
Movie Summarization at TRECVID 2022
 
Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022Disaster Scene Indexing at TRECVID 2022
Disaster Scene Indexing at TRECVID 2022
 
Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022Activity Detection at TRECVID 2022
Activity Detection at TRECVID 2022
 
Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022Multimedia Understanding at TRECVID 2022
Multimedia Understanding at TRECVID 2022
 
Video Search at TRECVID 2022
Video Search at TRECVID 2022Video Search at TRECVID 2022
Video Search at TRECVID 2022
 
TRECVID 2019 introduction slides
TRECVID 2019 introduction slidesTRECVID 2019 introduction slides
TRECVID 2019 introduction slides
 
TRECVID 2019 Instance Search
TRECVID 2019 Instance SearchTRECVID 2019 Instance Search
TRECVID 2019 Instance Search
 
TRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video SearchTRECVID 2019 Ad-hoc Video Search
TRECVID 2019 Ad-hoc Video Search
 
[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search[TRECVID 2018] Instance Search
[TRECVID 2018] Instance Search
 
[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search[TRECVID 2018] Ad-hoc Video Search
[TRECVID 2018] Ad-hoc Video Search
 
Instance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVIDInstance Search (INS) task at TRECVID
Instance Search (INS) task at TRECVID
 
Activity detection at TRECVID
Activity detection at TRECVIDActivity detection at TRECVID
Activity detection at TRECVID
 
The Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVIDThe Video-to-Text task evaluation at TRECVID
The Video-to-Text task evaluation at TRECVID
 
Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)Introduction about TRECVID (The Video Retreival benchmark)
Introduction about TRECVID (The Video Retreival benchmark)
 
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)
 
TRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event DetectionTRECVID 2016 : Multimedia event Detection
TRECVID 2016 : Multimedia event Detection
 
TRECVID 2016 : Concept Localization
TRECVID 2016 : Concept LocalizationTRECVID 2016 : Concept Localization
TRECVID 2016 : Concept Localization
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
TRECVID 2016 : Ad-hoc Video Search
TRECVID 2016 : Ad-hoc Video Search TRECVID 2016 : Ad-hoc Video Search
TRECVID 2016 : Ad-hoc Video Search
 
TRECVID 2016 workshop
TRECVID 2016 workshopTRECVID 2016 workshop
TRECVID 2016 workshop
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

[TRECVID 2018] Video to Text

  • 1. TRECVID 2018 Video to Text Description AsadA. Butt NIST GeorgeAwad NIST; Dakota Consulting, Inc Alan Smeaton Dublin City University 1TRECVID 2018 Disclaimer: Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.
  • 2. Goals and Motivations ✓Measure how well an automatic system can describe a video in natural language. ✓Measure how well an automatic system can match high-level textual descriptions to low-level computer vision features. ✓Transfer successful image captioning technology to the video domain. Real world Applications ✓Video summarization ✓Supporting search and browsing ✓Accessibility - video description to the blind ✓Video event prediction 2TRECVID 2018
  • 3. • Systems are asked to submit results for two subtasks: 1. Matching & Ranking: Return for each URL a ranked list of the most likely text description from each of the five sets. 2. Description Generation: Automatically generate a text description for each URL. 3 TASKS TRECVID 2018
  • 4. Video Dataset • Crawled 50k+ Twitter Vine video URLs. • Max video duration == 6 sec. • A subset of 2000 URLs (quasi) randomly selected, divided amongst 10 assessors. • Significant preprocessing to remove unsuitable videos. • Final dataset included 1903 URLs due to removal of videos from Vine. 4TRECVID 2018
  • 5. Steps to Remove Redundancy ▪ Before selecting the dataset, we clustered videos based on visual similarity. ▪ Used a tool called SOTU [1], which used Visual Bag of Words to cluster videos with 60% similarity for at least 3 frames. ▪ Resulted in the removal of duplicate videos, as well as those which were very visually similar (e.g. soccer games), resulting in a more diverse set of videos. TRECVID 2018 5 [1] Zhao, Wan-Lei and Ngo Chong-Wah. "SOTU in Action." (2012).
  • 6. Dataset Cleaning ▪ Dataset Creation Process: Manually went through large collection of videos. ▪ Used list of commonly appearing videos from last year to select a diverse set of videos. ▪ Removed videos with multiple, unrelated segments that are hard to describe. ▪ Removed any animated (or otherwise unsuitable) videos. ▪ Resulted in a much cleaner dataset. TRECVID 2018 6
  • 7. Annotation Process • Each video was annotated by 5 assessors. • Annotation guidelines by NIST: • For each video, annotators were asked to combine 4 facets if applicable: • Who is the video describing (objects, persons, animals, …etc) ? • What are the objects and beings doing (actions, states, events, …etc)? • Where (locale, site, place, geographic, ...etc) ? • When (time of day, season, ...etc) ? TRECVID 2018 7
  • 8. Annotation Process – Observations 1. Different assessors provide varying amount of detail when describing videos. Some assessors had very long sentences to incorporate all information, while others gave a brief description. 2. Assessors interpret scenes according to cultural or pop cultural references, not universally recognized. 3. Specifying the time of the day was often not possible for indoor videos. 4. Given the removal of videos with multiple disjointed scenes, assessors were better able to provide descriptions. TRECVID 2018 8
  • 9. Sample Captions of 5 Assessors TRECVID 2018 9 1. Orange car #1 on gray day drives around curve in road race test. 2. Orange car drives on wet road curve with observers. 3. An orange car with black roof, is driving around a curve on the road, while a person, wearing grey is observing it. 4. The orange car is driving on the road and going around a curve. 5. Advertisement for automobile mountain race showing the orange number one car navigating a curve on the mountain during the race in the evening; an individual is observing the vehicle dressed in jeans and cold weather coat. 1. A woman lets go of a brown ball attached to overhead wire that comes back and hits her in the face. 2. In a room, a bowling ball on a string swings and its a woman with a white shirt on in the face. 3. During a demonstration a white woman with black hair wearing a white top and holding a ball tether to a line from above as the demonstrator tells her to let go of the ball which returns on its tether and hits the woman in the face. 4. A man in blue holds a ball on a cord and lets it swing, and it comes back and hits a woman in white in the face. 5. A young girl, before an audience of students, allows a pendulum to swing from her face and all are surprised when it returns to strike her.
  • 10. 2018 Participants (12 teams finished) Matching & Ranking (26 Runs) Description Generation (24 Runs) INF P P KSLAB P P KU_ISPL P P MMSys_CCMIP P P NTU_ROSE P P PicSOM P UPCer P UTS_CETC_D2DCRC_ CAI P P EURECOM P ORAND P RUCMM P UCR_VCG P 10TRECVID 2018
  • 11. Sub-task 1: Matching & Ranking 11TRECVID 2018 Person reading newspaper outdoors at daytime Three men running in the street at daytime Person playing golf outdoors in the field Two men looking at laptop in an office • Up to 4 runs per site were allowed in the Matching & Ranking subtask. • Mean inverted rank used for evaluation. • Five sets of descriptions used.
  • 12. Matching & Ranking Results – Set A 12TRECVID 2018 0 0.1 0.2 0.3 0.4 0.5 0.6 Run 1 Run 2 Run 3 Run 4 MeanInvertedRank
  • 13. Matching & Ranking Results – Set B 13TRECVID 2018 0 0.1 0.2 0.3 0.4 0.5 0.6 Run 1 Run 2 Run 3 Run 4 MeanInvertedRank
  • 14. Matching & Ranking Results – Set C 14TRECVID 2018 0 0.1 0.2 0.3 0.4 0.5 0.6 Run 1 Run 2 Run 3 Run 4 MeanInvertedRank
  • 15. Matching & Ranking Results – Set D 15TRECVID 2018 0 0.1 0.2 0.3 0.4 0.5 0.6 Run 1 Run 2 Run 3 Run 4 MeanInvertedRank
  • 16. Matching & Ranking Results – Set E 16TRECVID 2018 0 0.1 0.2 0.3 0.4 0.5 0.6 Run 1 Run 2 Run 3 Run 4 MeanInvertedRank
  • 17. Systems Rankings for each Set A B C D E RUCMM RUCMM RUCMM RUCMM RUCMM INF INF INF INF INF EURECOM EURECOM EURECOM EURECOM EURECOM UCR_VCG UCR_VCG UCR_VCG UCR_VCG UCR_VCG NTU_ROSE KU_ISPL ORAND KU_ISPL KU_ISPL KU_ISPL ORAND KU_ISPL ORAND ORAND ORAND NTU_ROSE NTU_ROSE KSLAB KSLAB KSLAB UTS_CETC_D2DCR C_CAI KSLAB NTU_ROSE UTS_CETC_D2DCR C_CAI UTS_CETC_D2DCR C_CAI KSLAB UTS_CETC_D2DCR C_CAI UTS_CETC_D2DCR C_CAI NTU_ROSE MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP TRECVID 2018 17 Not much difference between these runs.
  • 18. Top 3 Results TRECVID 2018 18 #1874 #1681 #598
  • 19. Bottom 3 Results TRECVID 2018 19 #1029 #958 #1825
  • 20. Sub-task 2: Description Generation TRECVID 2018 20 “a dog is licking its nose” Given a video Generate a textual description • Up to 4 runs in the Description Generation subtask. • Metrics used for evaluation: • BLEU (BiLingual Evaluation Understudy) • METEOR (Metric for Evaluation of Translation with Explicit Ordering) • CIDEr (Consensus-based Image Description Evaluation) • STS (Semantic Textual Similarity) • DA (Direct Assessment), which is a crowdsourced rating of captions using Amazon Mechanical Turk (AMT) • Run Types • V (Vine videos used for training) • N (Only non-Vine videos used for training) Who ? What ? Where ? When ?
  • 21. CIDEr Results TRECVID 2018 21 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Run 1 Run 2 Run 3 Run 4
  • 22. CIDEr-D Results TRECVID 2018 22 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Run 1 Run 2 Run 3 Run 4
  • 23. METEOR Results TRECVID 2018 23 0 0.05 0.1 0.15 0.2 0.25 Run 1 Run 2 Run 3 Run 4
  • 24. BLEU Results TRECVID 2018 24 0 0.005 0.01 0.015 0.02 0.025 0.03 Run 1 Run 2 Run 3 Run 4
  • 25. STS Results TRECVID 2018 25 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Run 1 Run 2 Run 3 Run 4
  • 26. CIDEr Results – Run Type TRECVID 2018 26 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 V N
  • 27. Direct Assessment (DA) • Measures … • RAW: Average DA score [0..100] for each system (non- standardised) – micro-averaged per caption then overall average • Z: Average DA score per system after standardisation per individual AMT worker’s mean and std. dev. score. TRECVID 2018 29
  • 28. DA results - Raw TRECVID 2018 30 0 10 20 30 40 50 60 70 80 90 100 Raw
  • 29. DA results - Z TRECVID 2018 31 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Z
  • 30. What DA Results Tell Us .. TRECVID 2018 33 1. Green squares indicate a significant “win” for the row over the column. 2. No system yet reaches human performance. 3. Humans B and E statistically perform better than Human D. 4. Amongst systems, INF outperforms the rest.
  • 31. Systems Rankings for each Metric CIDEr CIDEr-D METEOR BLEU STS DA INF INF INF INF INF INF UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI NTU_ROSE UPCer UPCer UPCer PicSOM UPCer PicSOM KSLAB PicSOM PicSOM NTU_ROSE PicSOM UPCer PicSOM KU_ISPL KSLAB UPCer KU_ISPL KSLAB NTU_ROSE KSLAB KU_ISPL KU_ISPL KSLAB KU_ISPL KU_ISPL NTU_ROSE NTU_ROSE KSLAB NTU_ROSE MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP TRECVID 2018 34
  • 32. Observations • The task continues to evolve as the number of annotations per video were standardized to 5 (compare to last year’s task). • Tried to remove redundancy and create a diverse set with little or no ambiguity for matching sub-task. • Steps were taken to ensure that a cleaner dataset was used for the task. TRECVID 2018 36
  • 33. Participants • Teams that will present today: • RUCMM • KU_ISPL • INF • Very high level bullets on approaches by other teams. TRECVID 2018 37
  • 34. UTS_CETC_D2DCRC • Widely used LSTM based sequence to sequence model. • Focus on improving generalization ability of the model. • Different training strategies used. • Several combinations of spatial and temporal features are ensembled together. • Simple model structure preferred to help generalization ability. • Training data: MSVD, MSR-VTT 2016, TGIF, VTT 2016, VTT 2017 TRECVID 2018 38
  • 35. PicSOM Description Generation • LSTM recurrent neural networks used to generate descriptions using multi-modal features. • Visual features include image and video features and trajectory features. • Audio features also used. • Training datasets used: MS COCO, MSR-VTT, TGIF, MSVD. • Significant improvement by expanding MSR-VTT training dataset with MS COCO. TRECVID 2018 39
  • 36. KSLAB • The main idea is to extract representations from only key frames. • Key frames are detected for different types of events. • The method uses a CNN encoder and LSTM decoder. • Model trained using MS COCO dataset. TRECVID 2018 40
  • 37. NTU_ROSE • Matching & Ranking • Trained 2 different models on MS COCO dataset. • Image based retrieval methods found suitable. • Description Generation • Training dataset: MSR-VTT and MSVD. • CST-captioning (Consensus-based Sequence Training) used as baseline and adapted. • Both visual and audio features used. • Model trained on MSR-VTT performed better, probably because it generates longer sentences than one trained on MSVD. TRECVID 2018 41
  • 38. MMSys • Matching & Ranking • Wikipedia and Pascal Sentence datasets used for training. • Used pre-trained cross-modal retrieval method for matching task. • Description Generation • MSR-VTT dataset used for training. • Extract 1 fps per video and used pre-trained Inception-ResNetV2 to extract features. • Used sen2vec for text features. • Model trained on frame and text features. TRECVID 2018 42
  • 39. EURECOM Matching & Ranking • Improved approach of best team of 2017 (DL-61-86). • Feature vectors derived from frames extracted at 2 fps using final layer of ResNet-152. • Contextualized features obtained and combined through soft attention mechanism. • Resulting vector v fed into two fully connected layers using RELU activation. • Vector v concatenated with vector from last layer of an RGB-I3D. • Instead of using Res-Net152 trained on ImageNet, it is also finetuned on MSCOCO. TRECVID 2018 43
  • 40. UCR_VCG Matching & Ranking • MS-COCO dataset used for training. • Keyframes extracted from videos – representative frames • A joint image-text embedding approach used to match videos to descriptions. TRECVID 2018 44
  • 41. Conclusion • Good number of participation. Task will be renewed. • This year we had more annotations per video. • A cleaner dataset created. • Direct Assessment was used for a second year running. This year we included multiple human responses. The results are interesting. • Lots of available training sets, some overlap ... MSR- VTT, MS-COCO, ImageNet, YouTube2Text, MSVD, TRECVid2016-2017 VTT, TGIF • Some teams used audio features in addition to visual features. TRECVID 2018 45
  • 42. Discussion • Is there value in the caption ranking sub-task? Should it be continued, especially with some teams participating only in this subtask? • Is the inclusion of run type (N or V) valuable? • Other possible run types? Video datasets only vs. video + image captioning training datasets. • Possibilities for a new dataset? • Are more teams planning to use audio features? What about motion from video? • What did individual teams learn? TRECVID 2018 46