CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking

•

1 like•253 views

In this report, we present our experiments performed for the Hyperlinking part of the Search and Hyperlinking Task in MediaEval Benchmark 2014. Our system successfully combines features from multiple modalities (textual, visual, and prosodic) and confirms the positive effect of our former method for segmentation based on Decision Trees. ceur-ws.org/Vol-1263/mediaeval2014_submission_21.pdf

CUNI at MediaEval 2014 Search and Hyperlinking Task:
Visual and Prosodic Features in Hyperlinking
Petra Galuščáková, Martin Kruliš, Jakub Lokoč and Pavel Pecina
Charles University in Prague, Faculty of Mathematics and Physics
• Find segments similar to a given (query) segment
in the collection of audio-visual recordings (TV
programmes provided by BBC).
• Available metadata, subtitles, and automatic
transcripts (LIMSI, LIUM, and NST-Sheffield),
visual features (shots and keyframes), and
prosodic features.
1. Search System
• The search system used for Hyperlinking is
identical to the system used in the Search sub-task.
• Segmentation: we used segments of fixed length
and our segmentation system which based on
Decision Trees (DT).
• The calculated VisualSimilarity was used to modify the
final score of the segment in the retrieval as follows:
4. Prosodic Similarity
• We took sequences of vectors of prosodic
features appearing up to 1 second from the
beginning of the query segment and found the
most similar sequence of the vectors in each
segment.
Transcripts Segmentation Weightening Metadata Overlap MAP P 5 P 10 P 20 MAP-bin MAP-tol
Subtitles Fixed None No No 0.1618 0.4786 0.4107 0.2893 0.1423 0.1216
Subtitles Fixed Visual No No 0.1660 0.4929 0.4143 0.3000 0.1483 0.1245
Subtitles Fixed None Yes No 0.4301 0.8600 0.7767 0.5483 0.2689 0.2465
Subtitles Fixed Visual Yes Yes 4.1824 0.9667 0.9567 0.8967 0.3080 0.0996
Subtitles Fixed Visual Yes No 0.4366 0.8667 0.7700 0.5633 0.2724 0.2580
Subtitles Fixed Prosodic Yes No 0.4321 0.8533 0.7767 0.5517 0.2687 0.2473
Subtitles DT Visual Yes No 0.8253 0.8867 0.8567 0.7383 0.2525 0.1991
LIMSI DT Visual Yes No 0.5196 0.8333 0.7233 0.5817 0.2681 0.1976
LIMSI Fixed Visual Yes Yes 4.0331 0.9400 0.9233 0.8983 0.3042 0.0950
LIUM DT Visual Yes No 0.5195 0.8200 0.7467 0.5733 0.2674 0.2134
LIUM Fixed Visual Yes Yes 3.8916 0.9267 0.8800 0.8800 0.2880 0.0993
NST-Sheffield DT Visual Yes No 0.6889 0.8400 0.8067 0.6500 0.2666 0.1955
NST-Sheffield Fixed Visual Yes Yes 3.9822 0.9267 0.9267 0.8817 0.2949 0.0914
Tab1. Results of the Hyperlinking sub-task for different transcripts, segmentation types,
weightening types, metadata, and removal of overlapping retrieved segments.
3. Visual Similarity
• We calculated VisualSimilarity between each
segment and each query segment using the
keyframes.
• We used the Signature Quadratic Form Distance
and Feature Signatures.
• The fixed-length segmentation outperforms the
DT-based segmentation with 120-seconds-long
segments
• 50-seconds-long DT-segments outperform the
fixed-length segments measured by MAP and
precision-based measures, but are outperformed by
fixed-length segments using the MAP-bin and
MAP-tol measures
• There is a constant improvement obtained by using
the visual weights.
• There is small but promising improvement in
MAP, P20, and MAP-tol measures by using the
prosodic features.
• The concatenation with the context and metadata
is proved to be beneficial.
6. Conclusion
• We successfully combine features from multiple
modalities (textual, visual, and prosodic).
• The segmentation employing the Decision Trees
outperforms the fixed-length segmentation.
Acknowledgments
This research is supported by the Czech Science Foundation, grant
number P103/12/G084, Charles University Grant Agency GA UK, grant
number 920913, and by SVV project number 260 104.
2. Hyperlinking
• We first transformed the query segment into a
textual query by including all the words of the
subtitles lying within the segment boundary.
• We extended the segment boundary by including
the context surrounding the query segment.
• We combined textual and visual/prosodic
similarity.
5. Results
http://siret.ms.mff.cuni.cz

Similar to CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking

Deep learning-based switchable network for in-loop filtering in high efficie...IJECEIAES

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri

RIPE 78: A review of the KSK RollAPNIC

Novel Approach for Evaluating Video Transmission using Combined Scalable Vide...IJECEIAES

Search and Hyperlinking Overview @MediaEval2014Maria Eskevich

Fast channel zapping with destination oriented multicast for ip video deliveryecway

Java fast channel zapping with destination-oriented multicast for ip video d...ecwayerode

Internship PresentationAlejandro_Montoya

2016-02-17 research seminarifi8106tlu

How much position information do convolutional neural networks encode? review...Dongmin Choi

[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇台灣資料科學年會

Research and activity reportMarco Cagnazzo

New Microsoft PowerPoint Presentation.pptxAbdulRehman743834

scopus indexed journals listrikaseorika

published journalsrikaseorika

Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...Vignesh V Menon

Mobile Visual Search: Object Re-Identification Against Large RepositoriesUnited States Air Force Academy

Machine Learning approaches at video compression Roberto Iacoviello

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri

CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical FashionInterpretOmics

Similar to CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking (20)

Deep learning-based switchable network for in-loop filtering in high efficie...

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...

RIPE 78: A review of the KSK Roll

Novel Approach for Evaluating Video Transmission using Combined Scalable Vide...

Search and Hyperlinking Overview @MediaEval2014

Fast channel zapping with destination oriented multicast for ip video delivery

Java fast channel zapping with destination-oriented multicast for ip video d...

Internship Presentation

2016-02-17 research seminar

How much position information do convolutional neural networks encode? review...

[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇

Research and activity report

New Microsoft PowerPoint Presentation.pptx

scopus indexed journals list

published journals

Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...

Mobile Visual Search: Object Re-Identification Against Large Repositories

Machine Learning approaches at video compression

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...

CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

More from multimediaeval

Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...multimediaeval

HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...multimediaeval

Sports Video Classification: Classification of Strokes in Table Tennis for Me...multimediaeval

Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...multimediaeval

Essex-NLIP at MediaEval Predicting Media Memorability 2020 Taskmultimediaeval

Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...multimediaeval

Fooling an Automatic Image Quality Estimatormultimediaeval

Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...multimediaeval

Pixel Privacy: Quality Camouflage for Social Imagesmultimediaeval

HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matchingmultimediaeval

Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...multimediaeval

HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...multimediaeval

Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...multimediaeval

Deep Conditional Adversarial learning for polyp Segmentationmultimediaeval

A Temporal-Spatial Attention Model for Medical Image Detectionmultimediaeval

HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...multimediaeval

Fine-tuning for Polyp Segmentation with Attentionmultimediaeval

Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...multimediaeval

Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...multimediaeval

Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...multimediaeval

More from multimediaeval (20)

Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...

HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...

Sports Video Classification: Classification of Strokes in Table Tennis for Me...

Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...

Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task

Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...

Fooling an Automatic Image Quality Estimator

Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...

Pixel Privacy: Quality Camouflage for Social Images

HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching

Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...

HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...

Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...

Deep Conditional Adversarial learning for polyp Segmentation

A Temporal-Spatial Attention Model for Medical Image Detection

HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...

Fine-tuning for Polyp Segmentation with Attention

Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...

Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...

Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...

CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking

1. CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking Petra Galuščáková, Martin Kruliš, Jakub Lokoč and Pavel Pecina Charles University in Prague, Faculty of Mathematics and Physics • Find segments similar to a given (query) segment in the collection of audio-visual recordings (TV programmes provided by BBC). • Available metadata, subtitles, and automatic transcripts (LIMSI, LIUM, and NST-Sheffield), visual features (shots and keyframes), and prosodic features. 1. Search System • The search system used for Hyperlinking is identical to the system used in the Search sub-task. • Segmentation: we used segments of fixed length and our segmentation system which based on Decision Trees (DT). • The calculated VisualSimilarity was used to modify the final score of the segment in the retrieval as follows: 4. Prosodic Similarity • We took sequences of vectors of prosodic features appearing up to 1 second from the beginning of the query segment and found the most similar sequence of the vectors in each segment. Transcripts Segmentation Weightening Metadata Overlap MAP P 5 P 10 P 20 MAP-bin MAP-tol Subtitles Fixed None No No 0.1618 0.4786 0.4107 0.2893 0.1423 0.1216 Subtitles Fixed Visual No No 0.1660 0.4929 0.4143 0.3000 0.1483 0.1245 Subtitles Fixed None Yes No 0.4301 0.8600 0.7767 0.5483 0.2689 0.2465 Subtitles Fixed Visual Yes Yes 4.1824 0.9667 0.9567 0.8967 0.3080 0.0996 Subtitles Fixed Visual Yes No 0.4366 0.8667 0.7700 0.5633 0.2724 0.2580 Subtitles Fixed Prosodic Yes No 0.4321 0.8533 0.7767 0.5517 0.2687 0.2473 Subtitles DT Visual Yes No 0.8253 0.8867 0.8567 0.7383 0.2525 0.1991 LIMSI DT Visual Yes No 0.5196 0.8333 0.7233 0.5817 0.2681 0.1976 LIMSI Fixed Visual Yes Yes 4.0331 0.9400 0.9233 0.8983 0.3042 0.0950 LIUM DT Visual Yes No 0.5195 0.8200 0.7467 0.5733 0.2674 0.2134 LIUM Fixed Visual Yes Yes 3.8916 0.9267 0.8800 0.8800 0.2880 0.0993 NST-Sheffield DT Visual Yes No 0.6889 0.8400 0.8067 0.6500 0.2666 0.1955 NST-Sheffield Fixed Visual Yes Yes 3.9822 0.9267 0.9267 0.8817 0.2949 0.0914 Tab1. Results of the Hyperlinking sub-task for different transcripts, segmentation types, weightening types, metadata, and removal of overlapping retrieved segments. 3. Visual Similarity • We calculated VisualSimilarity between each segment and each query segment using the keyframes. • We used the Signature Quadratic Form Distance and Feature Signatures. • The fixed-length segmentation outperforms the DT-based segmentation with 120-seconds-long segments • 50-seconds-long DT-segments outperform the fixed-length segments measured by MAP and precision-based measures, but are outperformed by fixed-length segments using the MAP-bin and MAP-tol measures • There is a constant improvement obtained by using the visual weights. • There is small but promising improvement in MAP, P20, and MAP-tol measures by using the prosodic features. • The concatenation with the context and metadata is proved to be beneficial. 6. Conclusion • We successfully combine features from multiple modalities (textual, visual, and prosodic). • The segmentation employing the Decision Trees outperforms the fixed-length segmentation. Acknowledgments This research is supported by the Czech Science Foundation, grant number P103/12/G084, Charles University Grant Agency GA UK, grant number 920913, and by SVV project number 260 104. 2. Hyperlinking • We first transformed the query segment into a textual query by including all the words of the subtitles lying within the segment boundary. • We extended the segment boundary by including the context surrounding the query segment. • We combined textual and visual/prosodic similarity. 5. Results http://siret.ms.mff.cuni.cz

CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking

Recommended

Recommended

More Related Content

Similar to CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking

Similar to CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking (20)

More from multimediaeval

More from multimediaeval (20)

CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking