Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multimodal Features for Search and Hyperlinking of Video Content

1,265 views

Published on

In the talk, I will discuss content-based retrieval in audio-visual collections. I will focus on retrieval of relevant segments of video using a textual query. In addition, I will describe techniques for detecting hyperlinks within audio-visual collections. Our retrieval system ranked first in the MediaEval 2014 Search and Hyperlinking shared task. The experiments were performed on almost 4000 hours of BBC broadcast video.

As the segmentation of the recordings shows to be crucial for high-quality video retrieval and hyperlinking, I will focus on segmentation strategies. I will show the possibility of employment of the prosodic and visual information into the segmentation process. Our decision tree-based segmentation proved to outperform fixed-length segmentation which regularly achieves the best results in the retrieval process. Visual and prosodic similarity are also explored in addition to the hyperlinking based on the subtitles and automatic transcripts. The employment of the visual similarity achieves a constant improvement, while the employment of the prosodic similarity shows a small but promising improvement too.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Multimodal Features for Search and Hyperlinking of Video Content

  1. 1. Multimodal Features for Search and Hyperlinking of Video Content Petra Galuščáková galuscakova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Charles University in Prague 29. 10. 2014
  2. 2. 2 Outline ● Speech Retrieval and Hyperlinking ● Data and Evaluation ● System Description ● Passage Retrieval, Segmentation of Recordings ● Visual and Prosodic Information
  3. 3. 3 Speech Retrieval and Hyperlinking
  4. 4. 4 Search in Audio-Visual Documents ● Input: ● Data collection (video recordings) ● Query – Given as text ● Output: ● Relevant segments (passages) of documents ● E.g. “Children out on poetry trip Exploration of poetry by school children Poem writing”, “Space-Cowboys Space Pirates Pirates in Space talking music”, “animal park, kenya marathon , wildlife reserve”
  5. 5. 5 Hyperlinking ● Input: ● Data collection (video recordings) ● Query segment ● Output: ● Segments similar to the query segment
  6. 6. 6 Data and Evaluation
  7. 7. ● MediaEval is a benchmarking initiative dedicated to development, comparison, and improvement of strategies for processing and retrieving multimedia content. 7 ● E.g. speech recognition, multimedia content analysis, music and audio analysis, user-contributed information (tags, tweets), viewer affective response, social networks, temporal and geo-coordinates ● 2012 MediaEval Search and Hyperlinking Task ● 2013 MediaEval Search and Hyperlinking Task ● 2013 Similar Segments in Social Speech Task ● 2014 MediaEval Search and Hyperlinking Task
  8. 8. 8 Search and Hyperlinking Task ● The main goal of the Search Subtask ● Find passages relevant to a user’s interest given by a textual query in a large set of audio-visual recordings ● And of the Hyperlinking Subtask: ● To find more passages similar to the retrieved ones ● Scenario: ● A user wants to find a piece of information relevant to a given query in a collection of TV programmes (Search subtask) ● And then navigate through a large archive using hyperlinks to the retrieved segments (Hyperlinking subtask)
  9. 9. 9 Search and Hyperlinking Task 2014 Data ● TV programme recordings provided by BBC ● All BBC programmes broadcasted during 4 months ● 1335 hours for training, 2686 hours for testing ● Subtitles and three ASR transcripts (LIMSI, LIUM, and NST Sheffield) ● Metadata, detected shots, stable keyframes, prosodic features ● Search: 50 training and 30 test queries ● E.g. sightseeing london, egypt travel, celebrity diet ● Hyperlinking: 30 training and 30 test queries ● Given as a query segment (beginning and end)
  10. 10. 10 Evaluation ● Full document retrieval → MRR ● RR = 1 / rank of the first correctly retrieved document ● MRR = average of the RR values for the set of the queries ● Retrieval of the exact passages → MRR-window ● Starting points of retrieved segments is limited to appear less than 60 seconds from the starting point of the relevant segment to be considered as correctly retrieved ● MRRw = average of the RRw values for the set of the queries ● Retrieval of the exact passages → mGAP, MASP ● Takes into account the exact beginning (end) of a relevant segment
  11. 11. 11 Evaluation Cont. ● MAP, P5, P10, P20 ● MAP-bin ● MAP-tol Aly R., Eskevich M., Ordelman R., and Jones G.J.F.: Adapting Binary Information Retrieval Evaluation Metrics for Segment-based Retrieval Tasks. Technical Report, 2013.
  12. 12. 12 System Description
  13. 13. 13 Passage Retrieval ● Documents are automatically divided into shorter segments ● Segments serve as documents in the traditional IR setup ● The segmentation is crucial for the quality of the retrieval – Especially the segment length → We focus on the segmentation strategies
  14. 14. 14 Effect of Passage Retrieval Segm. Manual ASR MRR MRRw mGAP MRR MRRw mGAP None 0.879 0.315 0.029 0.858 0.333 0.027 Manual 0.897 0.671 0.277 0.885 0.669 0.247 ● Segmentation may highly improve retrieval of the segment beginnings (MRRw and mGAP measures) ● Segmentation may improve retrieval of full recordings (MRR measure) Similar Segments in Social Speech Task 2013
  15. 15. 15 Baseline System ● We employ the Terrier IR toolkit ● Hiemstra language model ● Parameter set to 0.35 (importance of a query term in a document) ● Stopwords removal, stemming ● Post-filtering of the answers
  16. 16. 16 Post-filtering Effect ● MAP, P5, P10 and P20 are notably higher in the experiments in which we did not remove partially overlapping segments ● These measures do not distinguish, whether a user had already seen the retrieved segment ● The overlapping segments are expected not to be so beneficial for the users Transcript Filtering MAP P5 P10 P20 MAP-bin MAP-tol Subtitles Yes 0.3692 0.7467 0.7133 0.6050 0.2606 0.2157 Subtitles No 16.3486 0.8400 0.8367 0.8433 0.3172 0.0515 Search and Hyperlinking Task 2014 (Search subtask)
  17. 17. 17 Baseline System - Hyperlinking ● Transformed into Search subtask ● Query segment is transformed into a textual query by including all the words of the subtitles lying within the segment boundary ● Queries created on subtitles outperform ASR queries ● Even if we run the retrieval on the ASR transcripts
  18. 18. 18 System Tuning ● Metadata ● Concatenate metadata with each segment ● Title, episode title, description, short episode synopsis, service name and program variant ● In Hyperlinking: Concatenate metadata with the query segment ● Context ● In Hyperlinking: use 200 seconds before the segment beginning and after the segment end
  19. 19. Transcript Tuning MAP P5 P10 P20 MAP-bin MAP-tol Subtitles None 0.4209 0.7933 0.7433 0.5950 0.3192 0.3155 Subtitles Metadata 0.5127 0.7467 0.7267 0.6100 0.3538 0.3023 Transcript Tuning MAP P5 P10 P20 MAP-bin MAP-tol Subtitles None 0.1147 0.3071 0.2786 0.2036 0.1021 0.0792 Subtitles Metadata +Context 0.4072 0.8067 0.7000 0.5417 0.2611 0.2237 19 System Tuning Cont. ● Search ● Hyperlinking Search and Hyperlinking Task 2014
  20. 20. 20 Segmentation Strategies
  21. 21. 21 Segmentation Types ● Fixed-length (Window-based) ● Segments of equilong length with regular shift ● Claimed to be a very effective approach ● Similarity-based ● Measure the similarity between neighboring segments (e.g. cosine distance) ● Algorithms TextTiling and C99 ● Lexical-chain-based ● A sequence of lexicographically related word occurrences ● Feature-based
  22. 22. 22 Fixed-Length Segmentation Comparison ● S – Sentence ● Sh – Shot ● Sp – Speech Segment ● TP – Time + Pause ● TO – Time + Overlap (Fixed- Length Segment) M. Eskevich et al.: Multimedia information seeking through search and hyperlinking, ICMR 2013. Search and Hyperlinking Task 2012 (Search Subtask)
  23. 23. 23 Fixed-length Segmentation Segment Length Search and Hyperlinking Task 2013 (Search subtask)
  24. 24. 24 Fixed-length Segmentation Segment Shift Search and Hyperlinking Task 2013 (Search subtask)
  25. 25. 25 Feature-based Segmentation
  26. 26. 26 Feature-based Segmentation ● We identify possible segment boundaries (beginnings and ends) ● J48 decision trees (almost equivalent to C4.5), Weka framework ● Training data available for the Similar Segments in Social Speech Task, MediaEval 2013 ● Manually marked segments ● Conversations between university students ● Binary classification problem ● For each word in the transcripts, we predict whether a segment boundary occurs after this word ● Classes: segment boundary and segment continuation
  27. 27. 27 Used Features ● Cue words and tags ● N-grams which frequently appear at segment boundaries ● N-grams most informative for segment boundaries ● Manually defined n-grams ● Letter cases ● Length of the silence before the word ● Measured as a difference between timestamps of two adjacent words ● Division given in transcripts (e.g., speech segments defined in the LIMSI transcripts) ● The output of the TextTiling algorithm
  28. 28. 28 Most Informative Features ● Division defined in the transcripts ● The length of silence ● Especially if it is longer than 300ms, 400ms, 500ms, 600ms) ● TextTiling algorithm output ● Segment beginnings: “if”, “I’m”, “especially”, “the”, “are you”, “you have”, “VBP PRP VBG”, … ● Segment ends: “good”, “interesting”, “lot”, …
  29. 29. 29 Feature-based Segmentation Approaches
  30. 30. 30 Feature-based Segmentation Approaches Comparison Beg. End MRR MRRw mGAP #Seg Len [s] -- -- 0.656 0.052 0.027 2 k 2531.6 Reg Reg 0.671 0.388 0.245 234 k 49.5 ML -- 0.549 0.117 0.060 3125 k 2.3 -- ML 0.607 0.310 0.192 280 k 29.0 ML B+50 0.685 0.412 0.272 5820 k 49.6 E+50 ML 0.715 0.428 0.298 2580 k 49.6 ML ML 0.626 0.392 0.229 5659 k 20.2 Search and Hyperlinking Task 2013 (Search subtask), Results on the subtitles
  31. 31. Feature-based Segmentation vs. Fixed-Length Segmentation 31 ● Search Task Transcript Segm. Seg. Len. MAP P5 P10 P20 MAP-bin ● Hyperlinking Task MAP-tol Subtitles Fixed 60s 0.5127 0.7467 0.7267 0.6100 0.3538 0.3023 Subtitles Featur es 50s 0.8028 0.7867 0.7667 0.6933 0.3199 0.2350 Transcript Segm. Seg. Search and Hyperlinking Task 2014 Len. MAP P5 P10 P20 MAP-bin MAP-tol Subtitles Fixed 60s 0.4366 0.8667 0.7700 0.5633 0.2724 0.2580 Subtitles Features 50s 0.8253 0.8867 0.8567 0.7383 0.2525 0.1991
  32. 32. 32 Visual Information in Segmentation ● Training data used for segmentation tuning are visually static visual information → would not be helpful ● Create segments only if visual similarity between adjacent segment < weight ● Tune the weight on the Search and Hyperlinking training data Transcript Segm. MAP P5 P10 P20 MAP-bin MAP-tol Subtitles Features 0.8028 0.7867 0.7667 0.6933 0.3199 0.2350 Subtitles Features+ Visual 0.7701 0.7600 0.7500 0.6733 0.3285 0.2530 Search and Hyperlinking Task 2014 (Search Subtask)
  33. 33. 33 Visual and Prosodic Similarity
  34. 34. 34 Visual Similarity
  35. 35. 35 Visual Similarity Cont. ● We use Feature Signatures and Signature Quadratic Form Distance http://siret.ms.mff.cuni.cz
  36. 36. 36 Visual Similarity Results Transcript Meta. Weights MAP P5 P10 P20 MAP-bin MAP-tol Subtitles No None 0.1618 0.4786 0.4107 0.2893 0.1423 0.1216 Subtitles No Visual 0.1660 0.4929 0.4143 0.3000 0.1483 0.1245 Subtitles Yes None 0.4301 0.8600 0.7767 0.5483 0.2689 0.2465 Subtitles Yes Visual 0.4366 0.8667 0.7700 0.5633 0.2724 0.2580 LIMSI Yes None 0.4166 0.8533 0.7133 0.5450 0.2659 0.2297 LIMSI Yes Visual 0.4168 0.8667 0.7333 0.5400 0.2692 0.2414 LIUM Yes None 0.4226 0.8333 0.7300 0.5433 0.2593 0.2547 LIUM Yes Visual 0.4212 0.8400 0.7367 0.5350 0.2622 0.2632 NST Yes None 0.4072 0.8067 0.7000 0.5417 0.2611 0.2237 NST Yes Visual 0.4160 0.8267 0.7167 0.5483 0.2655 0.2440 Search and Hyperlinking Task 2014 (Hyperlinking subtask)
  37. 37. 37 Visual Similarity Results - Positive Query Examples
  38. 38. 38 Visual Similarity Results - Negative Query Examples
  39. 39. 39 Prosodic Similarity
  40. 40. 40 Prosodic Similarity Results Transcript Meta. Weights MAP P5 P10 P20 MAP-bin MAP-tol Subtitles Yes None 0.4301 0.8600 0.7767 0.5483 0.2689 0.2465 Subtitles Yes Prosodic 0.4321 0.8533 0.7767 0.5517 0.2687 0.2473 ● Small but promising improvement Search and Hyperlinking Task 2014 (Hyperlinking subtask)
  41. 41. 41 System Comparison
  42. 42. 42 Search Task
  43. 43. 43 Hyperlinking Task
  44. 44. 44 Conclusion
  45. 45. 45 Conclusion ● Passage Retrieval ● Improves retrieval of relevant segments ● Can improve retrieval of full recordings ● Segmentation approach is crucial for the retrieval ● Fixed-length segmentation works well ● Feature-based segmentation outperforms fixed-length segmentation ● Visual and prosodic similarity can improve results of text-based retrieval
  46. 46. 46 Thank you This research is supported by the Charles University Grant Agency (GA UK n. 920913)

×