Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Translating Related Words to Videos and Back through Latent Topics

965 views

Published on

Published in: Technology
  • Be the first to comment

Translating Related Words to Videos and Back through Latent Topics

  1. 1. Translating Related Words toVideos and Back through Latent TopicsPradipto Das, Rohini K. Srihari and Jason J. Corso SUNY Buffalo WSDM 2013, Rome, Italy
  2. 2. WiSDoM is beyond wordsMaster Yoda, how do I find wisdom Go to the center of the data and from so many things happening find your wisdom you will around us?
  3. 3. WiSDoM is beyond wordsMaster Yoda, how do I find wisdom Go to the center of the data and from so many things happening find your wisdom you will around us?
  4. 4. How do the centers look like?parkour perform traceur area flip footage jump park urban run lobster burger dress celery Christmas wrap roll mix tarragonoutdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlicfloor parkour wall jump handrail locker contestant school run make dog sandwich man outdoors guy bench black sit parkinterview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody Be careful on what people do with Interviews indoors can be tough! their sandwiches!
  5. 5. The actual ground-truth synopses overlaid Man performs Kid does parkour A family holds a strange burger assembly parkour in various around the park and wrapping contest at Christmas locations Footage of group of performing parkour outdoors tutorial: man explains how toparkour perform traceur area flip footageguys free urban run montage of jump park running lobster burger dressmake lobster rolls from scratch celery Christmas wrap roll mix tarragon up a tree and through theoutdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic woods interview with parkour contestants One guy is makingfloor parkour wall jump handrail locker contestant school run sandwich outdoors make dog sandwich man outdoors guy bench black sit parkinterview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody Be careful on what people do with Interviews indoors can be tough! their sandwiches!
  6. 6. Back to conventional wisdom: Translation S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011 There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description Three basic assumptions of Machine Learning are satisfied: 1) There is pattern 2) We do not know the target function 3) There is data to learn from Training Testing Topic Model (LDA) Regression F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
  7. 7. Back to conventional wisdom: Translation S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011 Giving back to the community:  Driverless blood flow patterns in the There is some model that captures the correspondence of thecars are already helpingthe brain to the world being observed visually impaired to drive around  It will them to to enable visually Given a slightly different pattern we are able to translate be good concepts present in our vocabulary to a lingual description impaired drivers to hear the scenery in front Three basic assumptions of Machine Learning are satisfied: 1) There is pattern 2) We do not know the target function 3) There is data to learn from Training Testing Topic Model (LDA) Regression F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
  8. 8. Do we speak all that we see?Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.3. Someone doing indoor rock climbing.
  9. 9. Centers of attention (topics) Not so important! Hand holding climbing surface How many rocks? The sketch in the board Wrist-watch What’s there in the back? Dress of the climber Empty slots Color of the floorMultiple Human Summaries: (Max 10 words i.e. imposing a length constraint)1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.3. Someone doing indoor rock climbing. Summaries point toward information needs!
  10. 10. From patterns to topics to sentences Adverb modifier (climbing where?) Direct Subject Direct ObjectA young man climbs an artificial rock wall indoors Spoken Language is complex – Adjective modifier structured according to various (What kind of wall?) grammars and dependent on active topics Different paraphrases describe the same visual input Major Topic: Rock climbing Sub-topics: artificial rock wall, indoor rock climbing gym
  11. 11. Object detection models Annotations for training object/concept models  Expensive frame-wise manual annotation efforts by drawing bounding boxes  Difficulties: camera shakes, camera motion, zooming  Careful consideration to which objects/concepts to annotate?  Focus on object/concept detection – Man with Climbing noisy for videos in-the-wildmicrophone person  Does not answer which objects/concepts are important for Trained Models summary generation?
  12. 12. Translating across modalities Learning latent translation spaces a.k.a topics  Mixed membership of latent topics  Some topics capture observations that co- occur commonly  Other topics allow for discrimination  Different topics can be responsible for different modalitiesNo annotations Human Synopsisneeded – only A young man isneed clip level climbing an artificial summary rock wall indoors
  13. 13. Translating across modalities Using learnt translation spaces for prediction  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for real valued data Text Translation ? p( wv | wO , wH ) O K H Ko 1 i 1 (O ) d , o ,i p( wv | i ) d( H ,)i p( wv | i ) ,h h 1 i 1
  14. 14. Translating across modalities Use learnt translation spaces for prediction  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for Responsibility of Responsibility of real valued data topic i over real topic i over discretevalued observations video features Text Translation Probability of learnt ? p( wv | wO , wH )  topic i explaining O K H K words in the text  o 1 i 1 (O ) d , o ,i p( wv | i ) d( H ,)i p( wv | i ) ,h h 1 i 1 vocabulary
  15. 15. Wisdom of the young padawans OB (Object Bank)  High level semantic representation of images from low level features [L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010] HOG3D (Histogram of oriented gradients in 3D)  Effective action recognition features for videos [A. Klaser, M. Marszalek, and C. Schmid. A spatio- temporal descriptor based on 3d-gradients. In BMVC, 2008] Color Histogram:  512 RGB color bins  histograms are computed on densely sampled frames  large deviations in the extremities of the color spectrum are discarded
  16. 16. Wisdom of the young padawans The video is about a man answering Two camera men film a cop to a question from the podium by taking a camera from a woman using a microphone sitting in a group Town hall meetingTopics Scenes from images belonging to different topics and sub-topics Rock climbing An young man climbs an A man climbs a boulder artificial rock wall indoors outdoors with a friend spotting Sub-Topics
  17. 17. Wisdom of the young padawansGlobal GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of thespatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.] eight perceptual dimensions capture most of the 3D structures of real-world scenes  naturalness, openness, perspective or expansion, size or roughness, ruggedness, mean depth, symmetry and complexityGIST in general terms: An energy space that pervades the arrangements of objects Does not really care about the specificity of the objects Helps us summarize an image even after it has disappeared from our sight
  18. 18. Yoda’s wisdom The video is about a man answering Two camera men film a cop to a question from the podium by taking a camera from a womanIt will bemicrophone using a nice sitting in a group to have the Force as a “feature”! For my ally is the Force,Its energy surrounds us and binds us. An young man climbs an we, Luminous beings are A man climbs a boulder artificial rock wall indoors not this crude matter. outdoors with a friend spotting
  19. 19. Datasets NISTs 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T datasets Training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an appliance 15) Working on a sewing project Each video has its own high level summary – varies from 2 to 40 words but on average 10 words 2062 clips in the training set and 530 clips for the first 5 events in the Dev-T set Dev-T summaries are only used as reference summaries for evaluation with up to 10 predicted keywords
  20. 20. The summarization perspective Sub-events e.g. Multiple sets of Multiple sentences (group ofskateboarding, snowboarding, sur fing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies Feeding animals Bag of keywords multi-document summaries Woodworking project Landing fishes Natural language multi-document summaries
  21. 21. The summarization perspective Sub-events e.g.Why event snowboarding,vocabularies? sets of skateboarding, specific sur Multiple Multiple sentences (group of fing documents (sets of segments in frames) frames in videos) Multimedia Skateboarding Topic Model – permute Model Actual Synopsis Wedding Predicted Words (top 10) event specific ceremony vocabularies One school man feeds fish fish jump bread fishing skateboard of thought bread pole machine car dog cat Another Feeding feeds fish man bread shampoo sit condiment place school of animals bread fill plate jump pole fishing Bag of keywords thought multi-document  Intuitively multiple objects and actions are shared and many summaries different words across eventsWoodworking semantically get associated project  Prediction quality degenerates rapidly! Landing fishes Natural language multi-document summaries
  22. 22. Previously [P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011] Words forming Article specific content wordsother Wiki articles Words corresponding to the embedded multimedia
  23. 23. Afterwards [P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and Back through Latent Topics,” WSDM, Rome, Italy, 2013] Words forming Article specific content wordsother Wiki articles Words corresponding to the embedded multimedia
  24. 24. The family of multimedia topic models• Corr-MMGLDA: If a single topic generates a scene – the same topic generates all text in the document – a considerable strongpoint but a drawback for summary generation if this is not the case• MMGLDA: More diffuse translation of both visual and textual patterns through the latent translation spaces – Intuitively it aids frequency based summarizationMMGLDA Corr- Key is to use an asymmetric Dirichlet prior MMGLDA Document specific topic proportions Indicator variables Synopses words GIST features Visual “words” Topic Parameters for explaining latent structure within observation ensembles
  25. 25. Topic modeling performance Test ELBOs on events 1-5 in  Prediction ELBOs on events the Dev-T set 1-5 in the Dev-T set Measuring held-out log  Measuring held-out log likelihoods on both videos and likelihoods on just videos in associated human summaries absence of the text  In a purely multinomial MMLDA model, failures of independent events contribute highly negative terms to the log likelihoods  Clearly NOT a measure of keyword summary generation power  For the MMGLDA family of models, Gaussian components can partially remove the independence through covariance modeling  This allows only the responsible topic-Gaussians to contribute to the likelihood
  26. 26. Translating Related Words to Videos Corr-MMGLDA MMGLDA 1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log(α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log(α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
  27. 27. Translating Related Words to Videos  Corr-MMGLDA is able to capture more variance relative to MMGLDA Corr-MMGLDA   for CorrMMGLDA is also slightly MMGLDA higher than that for MMGLDA  This can allow related but topically unique concepts to appear upfront 1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log(α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log(α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
  28. 28. Related Words to Videos – Difficult Examples measure project lady tape indoor sew marker pleat highwaist zigzag scissor card mark teach cut fold stitch pin woman skirt machine fabric inside scissors make leather kilt man beltloop sew woman fabric make machine show baby traditional loom blouse outdoors blanket quick rectangle hood knit indoor stitch scissors pin cut iron studio montage measure kid penguin dad stuff thread
  29. 29. Related Words to Videos – Difficult Examples clock mechanism repair computer tube wash machine lapse click desk mouse time front wd40 pliers reattach knob make level video water control person clip part wire inside indoor whirlpool man gear machine guy repair sew fan test make replace grease vintage motor box indoor man tutorial fuse bypass brush wrench repairman lubricate workshop bottom remove screw unscrew screwdriver video wire
  30. 30. A few words is worth a thousand frames! From MMGLDA
  31. 31. A few words is worth a thousand frames! From MMGLDA
  32. 32. Event classification and summarization Sub-events e.g. skateboarding, Multiple sets of Multiple sentences (group of snowboarding, surfing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies Feeding  A c-SVM classier from the libSVM package is animals Bag of words used with default settings for multiclass (15 multi-document classes) classification summaries Woodworking  55% test accuracy easily achievable project Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  33. 33. Event classification and summarization Sub-events e.g. skateboarding, Multiple sets of Multiple sentences (group of snowboarding, surfing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute - Usually changes from dataset to dataset Wedding event specific but max around 40-45% for 100 word ceremony vocabularies system summaries - If we can achieve 10% of this for 10 Feeding summaries, we are doing pretty word  A c-SVM classier from the libSVM package is animals good! Bag of words used with default Caveat – The text multi-document - settings for multiclass (15 multi-document classes) classification summaries summarization task is much more Woodworking  55% test accuracy easily achievable project complex than this simpler task Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  34. 34. Future Directions: - Typically lots of features help inEvent classification and summarization classification but do we need all of Sub-events e.g. skateboarding, them for better summary generation? Multiple sets of - Does better event classification of Multiple sentences (group snowboarding, surfing documents (sets of segments in frames) performance always mean better frames in videos) summarization performance? Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies - Usually changes from dataset to dataset Feeding max around 40-45% for 100 word but  A c-SVM classier from the summaries system libSVM package is animals Bag of words used with default If we can achieve 10% of this for 10 - settings for multiclass (15 multi-document classes) classification summaries, we are doing pretty summaries word Woodworking  55% test accuracy easily achievable good! project Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  35. 35. ROUGE-1 performance MMLDA can show poor ELBO – a bit misleading Performs quite well on predicting summary worthy keywords MMGLDA produces better topics and higher ELBO Summary worthiness of keywords almost same as MMLDA for lower n Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr- MGLDA w.r.t. quantitative evaluation Summary worthiness of keywords is not good but topics are good Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
  36. 36. ROUGE-1 performance MMLDA can show poor ELBO – a bit misleading Performs quite well on predicting summary worthy keywords Future Directions MMGLDA produces better topics and higher ELBO  Need better initialization of Summary worthiness of keywords parameters priors governing almost same as MMLDA forvaluedndata for real lower [N. Nasios and A.G. Bors. Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man, and Sum-normalizing the real B: Cybernetics, 36(4):849 {862, 2006] Cybernetics, Part valued data to lie in [0,1]P distorts reality for Corr- MGLDA w.r.t. quantitative evaluation Summary worthiness of keywords is not good but topics are good Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
  37. 37. Model usefulness and applications• Applications – Label topics through document level multimedia – Movie recommendations through semantically related frames – Video analysis: word prediction given video features – Adword creation through semantics of multimedia (Using transcripts only can be noisy) – Semantic compression of videos – Allowing the visually impaired to hear the world through text
  38. 38. Long list of acknowledgements• Scott McCloskey (Honeywell ACS Labs)• Sangmin Oh, Amitha Perera (Kitware Inc.)• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)For helping us with feature extractions, event classification evaluations andmany fruitful discussions throughout this project • Jack Gallant (UC Berkeley) • Francisco Pereira (Siemens Corporate Research) For allowing us to reuse some of their illustrations in this presentation• Lucy Vanderwende (Microsoft Research)• Enrique Alfonseca (Google Research)For helpful discussions during TAC 2011 on the importance of thesummarization problem outside of the competitions on newswire collections
  39. 39. Long list of acknowledgementsThis work was supported by the Intelligence Advanced ResearchProjects Activity (IARPA) via Department of Interior National BusinessCenter contract number D11PC20069. The U.S. Government isauthorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright annotation thereon.Disclaimer: The views and conclusions contained herein are those ofthe authors and should not be interpreted as necessarily representingthe official policies or endorsements, either expressed or implied, ofIARPA, DOI/NBC, or the U.S. Government.We also thank the anonymous reviewers for their comments
  40. 40. Thanks!

×