Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Processing Video Content and Transcript for Key-Topic Identification


Published on

Master Thesis Presentation: Processing Video Content and Transcript for Key-Topic Identification by Daniela Crete

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Processing Video Content and Transcript for Key-Topic Identification

  1. 1. Processing Video Content and Transcript for Key-Topic Identification Daniela Cretu 2570710
  2. 2. Problem Statement Large number of devices which can take pictures and videos lead to an increase in uploaded multimedia content 300 hours of video uploaded to YouTube each hour 3.25 billion hours of YouTube videos watched every month By 2020, Cisco forecasts it would take 5 million years for a person to watch every online video
  3. 3. Difference between video topic and description Recipe : relevant for video Subscriber channels - not related to video subject
  4. 4. Same video type - different categories Both cooking-related videos, yet appear in different categories
  5. 5. Tagging issues No tags! Many relevant tags Perhaps irrelevant tags?
  6. 6. Findability Problem Problem : content becomes less and less findable How can we fix this? Annotating the videos could improve findability
  7. 7. User annotation 2 problems : ● Small number of tags (average 9 per video) ● May contain irrelevant tags - to gain more views Alternative : automatically annotate the videos
  8. 8. Solution : Automatic annotation ● Process video streams (Google Video Intelligence, Clarifai API) ● Process video subtitles (Alchemy API, Google Natural Language) ● Tools processing the same type of data - likely yield different results - COMBINE THEM ● Disadvantage : video tools only able to provide content information, text tools only able to provide context information ● Best approach - Combine tools which process different video dimensions
  9. 9. Solution Approach
  10. 10. Related work ● Concept detection in video relies mostly on using low level image attributes (e.g. color histograms)-Lin et al, Chang et al ● Detecting concepts in subtitles - used to assign categories to videos (Katsiouli et al) or as basis for finding other relevant entities (Garcia et al) ● Crowdsourcing concepts - encourage users to play games and draw outlines of objects in the video (Di Salvo et al) or (Kavasidis et al) Color histograms for similar images
  11. 11. Research Question Main Research Question : How can we identify key topics in a video through processing of the video stream and its textual description? Two sub-research questions: ● How can we determine if certain concepts are more relevant than others ? (RQ1) ● How can we best align the concepts from the input sources(video stream and transcript) ? (RQ2)
  12. 12. Dataset ● YouTube videos ● Of various types and lengths ● Aimed to select videos which do not fit in more than one category ● In total 519 videos
  13. 13. Tools ● 1 subtitle processing tool - Google Natural Language [1] ○ Outputs detected concepts in the order of appearance ● 2 video processing tools - Clarifai[2] and Google Video Intelligence[3] ○ We chose these tools because the alternatives break down the video into keyframes and perform concept detection on images, rather than video Clarifai GVI Output format JSON JSON Tags per second yes no Tags ordered alphabetically no yes Occurrences of same tags grouped together no yes Confidence score for tag yes yes ‘Video relevant’ label for tags no yes [1] [2] [3]
  14. 14. GVI Sample Output TAG with single occurence Multiple occurrences of same tag are grouped Tag relevant at video level
  15. 15. Clarifai Sample Output - list of vectors Vector of seconds Vector of concepts for each second Vector of probabilitie for each concept
  16. 16. After running the tools on the dataset... ● Clarifai - highest number of tags ● Subtitles - lowest number of tags ● Large amount of unique tags between the tools - thus overlap is low
  17. 17. Research Methodology Aim : find key topics in the processed dataset
  18. 18. Research Methodology Tag Processing (for each topl separately) Step 1 : calculate number of occurrences and longest time interval of each tag In Figure 6: Black ● Number of occurrences = 1 ● Longest time interval = 5 (last - first) Classroom ● Number of occurrences = 3 ● Longest time interval = 5 (last - first)
  19. 19. Research Methodology Step 2 : Transform confidence scales so the tag with highest confidence score ends up having confidence = 1 (highest confidence score becomes divisor) a. Recalculate confidence for all other tags In the example to the right : text has the highest confidence score - use that as divisor
  20. 20. Research Methodology Step 3 : calculate relevance score for each tag Sum of confidence scores / video length in seconds Step 4 : combine tags from the three different outputs ● Use average formula ● If tag detected by tool, use relevance, if not, use 0 Combining tags from the three tools
  21. 21. Evaluation Goals We have identified 4 evaluation goals ● Confirm our computations (EV1) ● Check for bias towards one of the tools (EV2) ● Check for any correlation between bias and video characteristics (EV3) ● Check if the automatic tools may have missed something (EV4) Evaluate using crowdsourcing
  22. 22. Strategy for Selecting Videos to Evaluate Choose a sample of videos which have high overlap between the 3 tools Because it was concluded that shorter videos are more suitable for crowdsourcing (workers tend to lose focus for longer videos) we decided to show 10 second segments of video From the sample of videos - pick 10 second segments to evaluate Pick those segments from each video in which highly relevant (as resulted after combining the outputs of the tools) tags occur In total, 2169 segments to be evaluated, from 213 videos
  23. 23. Selecting Tags to display For each segment of video - compose a list of most relevant, maybe relevant and not so relevant tags from the tags for the overall video AT most 10 tags in each category 3 variables help to assign tags to categories: 1. Max relevance score for segment (MaxConf) 2. Tag’s relevance score (Rel) 3. A relevance threshold (Thresh - is 0.2 is MaxConf > 0.2 and is = 0.02 if MaxConf <=0.2) Assign tags in categories: 1. If MaxConf - Thresh < Rel < MaxConf AND less than 10 tags in category => put tag in that category 2. Repeat until rule no longer holds or more than 10 tags in category 3. MaxConf = MaxConf - Thresh 4. Repeat until categories full or no more tags
  24. 24. Crowdsourcing Task ● Ask users to watch 10 seconds of video ● Users can then select tags related to the video from the list ● Users can add any other tags they think are relevant to the segment ● Each task is evaluated by 15 workers ● Each worker gets 2 cents for each completed task ● Workers cannot submit the results without watching all 10 seconds
  25. 25. Evaluation Strategy Watch video (Step 1) Select tags (step 2) Add additional tags if desired (step 3)
  26. 26. Evaluation Results - EV1 At segment level, an average of 41.74% of highly relevant tags (as evaluated by the crowdsourcing workers) were correctly detected by the algorithm Maybe relevant tags - smallest overlap of all Additional subtitle tags (not detected by any tool other than subtitles) have highest overlap - BUT we counted each tag chosen by at least one worker in the same category (perhaps relevance is low ?)
  27. 27. Evaluation Results - EV1 At video level, an average of 46.19% of the tags which were evaluated as being highly relevant by the workers were also detected by the algorithm as being highly relevant Same as segment level, medium relevance tags have lowest overlap Low relevance tags slightly higher overlap than high relevance tags - for very short videos, there is higher overlap for highly relevant videos
  28. 28. Evaluation Results - EV2 ● Clarifai - mainly low relevance tags ● Most high relevance tags were detected by both visual processing tools ● Bias towards choosing tags detected by more than one tool
  29. 29. Evaluation Results - EV3 By assigning numerical values to the time distribution and tag category, we were able to calculate correlation with the help of the corresponding Excel function. Assignment: {100, 200, 300, 400} corresponds to {under 3 min, 3-5 min., 5-10 min and 10-15 min} {10, 20, 30, 40, 50} corresponds to {clarifai + gvi, clarifai + gvi + sub, clarifai, gvi, sub} High correlation score for cooking between time distribution and processing tool. Inexistent or not very strong correlation for the other 4 categories (for nature there is a correlation, but very light)
  30. 30. Evaluation Results - EV3 Using the same assignment as in the previous slide for the tag detection tools, we assigned {1,2,3,4,5} to be the alias of {cooking, culture, nature, travel,other} For all time distributions, correlation factor is negative No apparent correlation between categories and detection tools in any time distribution.
  31. 31. Evaluation Results - EV4 ● Only about 20% additional tags found in out lists ● Most of them with low relevance
  32. 32. Evaluation Result - RQ1 ● Identified a bias towards choosing tags detected by more than one tool ● These should be higher up in the list ● Better alignment strategy : instead of simple average, use a weighted average ● Assign higher weight to tags detected by more than one tool
  33. 33. Evaluation Result - RQ2 ● Current alignment detects 46.19% of highly relevant tags for the sampled videos (comparison between the highly relevant tags detected by our algorithm and the highly relevant tags chosen by crowdsourcing workers) ● There is a percentage of tags detected as being of medium relevance which have been promoted to high relevance after crowdsourcing ● Find a better relevance threshold
  34. 34. Evaluation result - RQ2 Examined users choice behaviour for each category (the other three categories on next slide) to see whether combining tools results in more accurate results ● For each category, tags selected by GVI + Clarifai are chosen more often that either Clarifai or GVI separately ● Adding subtitles does not make much of a difference (highest overlap score for highly relevant tags happens for tags detected by GVI+Clarifai ● Subtitles have the least chosen amount of tags ( remember that subtitle tags included here are not detected by any other tool)
  35. 35. Combining visual tools - better than using them individually Combining visual tags with subtitle - better than using just subtitles Linear increase in tags - as relevance decreases - number of tags increases
  36. 36. Conclusion and Future Work ● Our alignment strategy correctly detects around 46% of relevant tags for sampled videos ● Wanted to find out whether combining tools would yield better results ○ Tags from GVI are chosen more often than Clarifai tags ○ Most tags for sampled videos come from GVI + Clarifai - more relevant ○ Adding subtitles to visual tags -better than using just subtitle tags ● Differences between video categories are not that many - can use them as one single dataset ● Related work deals mostly with one source of information, whereas we deal with information from 3 different sources ○ Also mostly concerned with aligning tags to parts of the video, whereas we tried to find tags relevant to the whole video. ● Our algorithm can be improved ● Include crowdsourcing to identify better threshold, not just for confirmation ● Use weighted average as part of alignment
  37. 37. Questions?
  38. 38. References C. Y. Lin et al ‘VideoAL: A novel End-To-End MPEG-7 Video Automatic Labeling System’(2003) Chang, S. F. and Ellis, D. and Jiang, W. and Lee, K. and Yanagawa, A. and Loui, A. C. and Luo, J : ‘Large-Scale Multimodal Semantic Concept Detection for Consumer Video ‘ (2007) Katsiouli, P. and Tsetsos, V. and Hadjiefthymiades, S. : Semantic Video Classification Based on Subtitles and Domain Terminologies (2007) Garcia, J. L. R. and Vocht, L. and Troncy, R. and Mannens, E. and Van de Walle, R. : Describing and contextualizing events in TV news shows (2014) Di Salvo, R. and Giordano, D. and Kavasidi, I : A Crowdsourcing Approach to Support Video Annotation (2014) Kavasidis, I. and Palazzo, S. and Di Salvo, R. and Giordano, D. and Spampinato, C. : An innovative web-based collaborative platform for video annotation (2013)