Translating Related Words to Videos and Back through Latent Topics
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Translating Related Words to Videos and Back through Latent Topics

on

  • 580 views

 

Statistics

Views

Total Views
580
Views on SlideShare
571
Embed Views
9

Actions

Likes
1
Downloads
3
Comments
0

2 Embeds 9

http://www.linkedin.com 8
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
  • Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
  • Centers are the topics which correspond to some best description of data which are similar in some wayTrue Centers are never known---each one of us has an algorithm for finding centers---our own topic model
  • The actual ground-truth synopses overlaid over the training topics
  • BOLD (Blood Oxygen Level Dependent) and fMRI patternsImages used with permission from Jack Gallant and Francisco Pereira (by the way, both of them are now applying topic models to map brain patterns to movies or text)
  • A genuine philanthropic use case
  • The importance of relating multi-document summaries to that for summarizing videos – every frame is a document
  • Psycholinguistics are needed to confirm but that’s not a concern at this pointIn our dataset we have only one ground truth summary---base case for ROUGE evaluation
  • Ground truth annotationComplex high level descriptionsSpoken Language is complicated – We are corresponding it to a minimal set of features (next)
  • Upper row – training (camera motion and shakes are a real problem for maintaining the bounding boxes)Lower row – trained models
  • Role of alpha – alpha provides a topic for every observation. Alpha is a K-vectorHere each component of alpha is different which helps assign different proportions of observations differently (e.g. one topic can be focusing solely on “stop-words”, another one on “commonly occurring words” and other ones on the different topics etc.)
  • Translation formula (Marginalization over topics)- If there are two topics i.e. K=2, then (for e.g for the 2nd term) 0.5*0.5 + 0.5*0.5 = 0.5 < 0*0.0001 + 0.9*0.9- Values of the inferred \\phi’s are very important for the real valued data – separated Gaussians are better but does not always happen- This raises an issue where the real valued data may need to be preprocessed to increase the chances of separation
  • Object Bank (Computed on keyframes), HOG3D and Color histograms – features through the lens of computer vision
  • Important references: http://cvcl.mit.edu/papers/oliva04.pdf | http://vision.stanford.edu/VSS2007-NaturalScene/Oliva_VSS07.pdf- The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns should be different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
  • Red arrow means “lack of the corresponding GIST property” and green means ok- The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns are different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
  • - The dataset that we use for the video summarization task is released as part of NIST's 2011 TRECVID Multimedia Event Detection (MED) evaluation set. The dataset consists of a collection of Internet multimedia content posted to the various Internet video hosting sites. The training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Working on a woodworking project etc.We use the videos and their textual metadata in all the 15 events as training data. There are 2062 clips with summaries in the training set with almost equal distribution amongst the events. The test set which we use is called the TransparentDevelopment (Dev-T) collection. The Dev-T collection includes positive instances of the first 5 training events and near positive instances for the last 10 events---a total of 630 videos labeled with event category information (and associated human synopses which are to be compared against for summarization performance). Each summary is a short and very high level description of the entire video and ranges from 2 to 40 words but on average 10 words (with stopwords). We remove standard English stopwords and retain only the word morphologies (not required) from the synopses as our training vocabularies. The proportion of videos belonging to events 6 through 15 in the Dev-T set is much low compared to the proportion for the other events since those clips are considered to be “related" instances which cover only part of the event category specifications. The performances of our topic models are evaluated on those kinds of clips as well. The numbers of videos in events 6 through 15 in the Dev-T set are {4,9,5,7,8,3,3,3,10,8} while there are around 120 videos per event for the first 5 events. All other videos in the Dev-T set neither have any event category label nor are identified as positive, negative or related videos and we do not consider these videos in our experiments.
  • There are no individual summaries for shots within the clip – only one high level summaryProblems with shot-wise nearest neighbor matching precisely for this reason?
  • Why event specific vocabularies
  • Modeling correspondence of caption words to the main text content which can be annotated in various ways
  • “Dear Wikipedia readers: We are the small non-profit that runs the #5 website in the world. We have only 150 staff but serve 450 million users” – finding the reason why it might be so? (Both the main and embedded content reflects coherent topics e.g. if there appears an irrelevant advertisement, the topic will drift and Wikipedia will loose its appeal)
  • Corr-MMGLDA seems to be capturing more variance relative to MMGLDA\\alpha for CorrMMGLDA is thus slightly higher than that for MMGLDATopic parameters over words are seeded through documents during initialization and hence are same for both models here
  • This is a tough event to match words with frames. The event is “Working on a sewing project”Top row: frames coming from only one video. We do not put a constraint that we can select only 5 frames per video. Although this can be easily done. The shown video’s actual synopsis is “One lady is doing sewing project indoors.”Bottom row: better variance – Note how it captures dad sewing kid’s penguin with a needle and threadFirst row: “Woman demonstrating different stitches using a serger/sewing machine”Second row: “dad sewing up stuffed penguin for kids”Third row: “Woman makes a bordered hem skirt.”; Last one: “A pair of hands do a sewing project using a sewing machine.”Other features might help: Action, objects, GIST and color may not be enough
  • This is again another tough event to match words with frames. The event is “Repairing an appliance”Top row: frames coming from only one video. Bad example. The shown video’s actual synopsis is “How to repair the water level control mechanism on a Whirlpool washing machine.”Bottom row: better variance – Row1,Cols1-3: “a man is repairing a whirlpool washer” ;Row1,Col4 “how to remove blockage from a washing machine pump”; Row2,Cols1-3: “Woman demonstrates replacing a door hinge on a dishwasher”;Row2,Col4: “A guy shows how to make repairs on a microwave”;Row3,Cols1-3: “How to fix a broken agitator on a Whirlpool washing machine”;Row3,Col4: “A guy working on a vintage box fan”Other features might help: Action, objects, GIST and color not enough
  • Usually changes from dataset to dataset but max around 40-45% for 100 word summariesIf we can achieve 10% of this for 10 word summaries, we are doing pretty good!
  • Caveat – The text multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
  • Caveat – The multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
  • Purely multinomial topic models showing lower ELBOs can perform quite well in BoW summarization. MMLDA assigns likelihoods based on success and failure of independent events and failures contribute highly negative terms to the log likelihoods but this does not indicate the model's summarization performance where low probability terms are pruned out. Gaussian components can partially remove the independence through covariance modeling but this can also allow different but related topics to model GIST features almost equally (strong overlap in the tail of the bell shaped curves - Gaussians) and show poor permutation of predicted words due to the violation of the soft probabilistic constraint of correspondence
  • There has been some work done for initialization of priors for a Gaussian Mixture Model (GMM) setting but no work has been done on the effects of such initializations for topic models involving Gaussians and Multinomials
  • Never had the chance to acknowledge them all in the paper

Translating Related Words to Videos and Back through Latent Topics Presentation Transcript

  • 1. Translating Related Words toVideos and Back through Latent TopicsPradipto Das, Rohini K. Srihari and Jason J. Corso SUNY Buffalo WSDM 2013, Rome, Italy
  • 2. WiSDoM is beyond wordsMaster Yoda, how do I find wisdom Go to the center of the data and from so many things happening find your wisdom you will around us?
  • 3. WiSDoM is beyond wordsMaster Yoda, how do I find wisdom Go to the center of the data and from so many things happening find your wisdom you will around us?
  • 4. How do the centers look like?parkour perform traceur area flip footage jump park urban run lobster burger dress celery Christmas wrap roll mix tarragonoutdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlicfloor parkour wall jump handrail locker contestant school run make dog sandwich man outdoors guy bench black sit parkinterview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody Be careful on what people do with Interviews indoors can be tough! their sandwiches!
  • 5. The actual ground-truth synopses overlaid Man performs Kid does parkour A family holds a strange burger assembly parkour in various around the park and wrapping contest at Christmas locations Footage of group of performing parkour outdoors tutorial: man explains how toparkour perform traceur area flip footageguys free urban run montage of jump park running lobster burger dressmake lobster rolls from scratch celery Christmas wrap roll mix tarragon up a tree and through theoutdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic woods interview with parkour contestants One guy is makingfloor parkour wall jump handrail locker contestant school run sandwich outdoors make dog sandwich man outdoors guy bench black sit parkinterview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody Be careful on what people do with Interviews indoors can be tough! their sandwiches!
  • 6. Back to conventional wisdom: Translation S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011 There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description Three basic assumptions of Machine Learning are satisfied: 1) There is pattern 2) We do not know the target function 3) There is data to learn from Training Testing Topic Model (LDA) Regression F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
  • 7. Back to conventional wisdom: Translation S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011 Giving back to the community:  Driverless blood flow patterns in the There is some model that captures the correspondence of thecars are already helpingthe brain to the world being observed visually impaired to drive around  It will them to to enable visually Given a slightly different pattern we are able to translate be good concepts present in our vocabulary to a lingual description impaired drivers to hear the scenery in front Three basic assumptions of Machine Learning are satisfied: 1) There is pattern 2) We do not know the target function 3) There is data to learn from Training Testing Topic Model (LDA) Regression F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
  • 8. Do we speak all that we see?Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.3. Someone doing indoor rock climbing.
  • 9. Centers of attention (topics) Not so important! Hand holding climbing surface How many rocks? The sketch in the board Wrist-watch What’s there in the back? Dress of the climber Empty slots Color of the floorMultiple Human Summaries: (Max 10 words i.e. imposing a length constraint)1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.3. Someone doing indoor rock climbing. Summaries point toward information needs!
  • 10. From patterns to topics to sentences Adverb modifier (climbing where?) Direct Subject Direct ObjectA young man climbs an artificial rock wall indoors Spoken Language is complex – Adjective modifier structured according to various (What kind of wall?) grammars and dependent on active topics Different paraphrases describe the same visual input Major Topic: Rock climbing Sub-topics: artificial rock wall, indoor rock climbing gym
  • 11. Object detection models Annotations for training object/concept models  Expensive frame-wise manual annotation efforts by drawing bounding boxes  Difficulties: camera shakes, camera motion, zooming  Careful consideration to which objects/concepts to annotate?  Focus on object/concept detection – Man with Climbing noisy for videos in-the-wildmicrophone person  Does not answer which objects/concepts are important for Trained Models summary generation?
  • 12. Translating across modalities Learning latent translation spaces a.k.a topics  Mixed membership of latent topics  Some topics capture observations that co- occur commonly  Other topics allow for discrimination  Different topics can be responsible for different modalitiesNo annotations Human Synopsisneeded – only A young man isneed clip level climbing an artificial summary rock wall indoors
  • 13. Translating across modalities Using learnt translation spaces for prediction  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for real valued data Text Translation ? p( wv | wO , wH ) O K H Ko 1 i 1 (O ) d , o ,i p( wv | i ) d( H ,)i p( wv | i ) ,h h 1 i 1
  • 14. Translating across modalities Use learnt translation spaces for prediction  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for Responsibility of Responsibility of real valued data topic i over real topic i over discretevalued observations video features Text Translation Probability of learnt ? p( wv | wO , wH )  topic i explaining O K H K words in the text  o 1 i 1 (O ) d , o ,i p( wv | i ) d( H ,)i p( wv | i ) ,h h 1 i 1 vocabulary
  • 15. Wisdom of the young padawans OB (Object Bank)  High level semantic representation of images from low level features [L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010] HOG3D (Histogram of oriented gradients in 3D)  Effective action recognition features for videos [A. Klaser, M. Marszalek, and C. Schmid. A spatio- temporal descriptor based on 3d-gradients. In BMVC, 2008] Color Histogram:  512 RGB color bins  histograms are computed on densely sampled frames  large deviations in the extremities of the color spectrum are discarded
  • 16. Wisdom of the young padawans The video is about a man answering Two camera men film a cop to a question from the podium by taking a camera from a woman using a microphone sitting in a group Town hall meetingTopics Scenes from images belonging to different topics and sub-topics Rock climbing An young man climbs an A man climbs a boulder artificial rock wall indoors outdoors with a friend spotting Sub-Topics
  • 17. Wisdom of the young padawansGlobal GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of thespatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.] eight perceptual dimensions capture most of the 3D structures of real-world scenes  naturalness, openness, perspective or expansion, size or roughness, ruggedness, mean depth, symmetry and complexityGIST in general terms: An energy space that pervades the arrangements of objects Does not really care about the specificity of the objects Helps us summarize an image even after it has disappeared from our sight
  • 18. Yoda’s wisdom The video is about a man answering Two camera men film a cop to a question from the podium by taking a camera from a womanIt will bemicrophone using a nice sitting in a group to have the Force as a “feature”! For my ally is the Force,Its energy surrounds us and binds us. An young man climbs an we, Luminous beings are A man climbs a boulder artificial rock wall indoors not this crude matter. outdoors with a friend spotting
  • 19. Datasets NISTs 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T datasets Training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an appliance 15) Working on a sewing project Each video has its own high level summary – varies from 2 to 40 words but on average 10 words 2062 clips in the training set and 530 clips for the first 5 events in the Dev-T set Dev-T summaries are only used as reference summaries for evaluation with up to 10 predicted keywords
  • 20. The summarization perspective Sub-events e.g. Multiple sets of Multiple sentences (group ofskateboarding, snowboarding, sur fing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies Feeding animals Bag of keywords multi-document summaries Woodworking project Landing fishes Natural language multi-document summaries
  • 21. The summarization perspective Sub-events e.g.Why event snowboarding,vocabularies? sets of skateboarding, specific sur Multiple Multiple sentences (group of fing documents (sets of segments in frames) frames in videos) Multimedia Skateboarding Topic Model – permute Model Actual Synopsis Wedding Predicted Words (top 10) event specific ceremony vocabularies One school man feeds fish fish jump bread fishing skateboard of thought bread pole machine car dog cat Another Feeding feeds fish man bread shampoo sit condiment place school of animals bread fill plate jump pole fishing Bag of keywords thought multi-document  Intuitively multiple objects and actions are shared and many summaries different words across eventsWoodworking semantically get associated project  Prediction quality degenerates rapidly! Landing fishes Natural language multi-document summaries
  • 22. Previously [P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011] Words forming Article specific content wordsother Wiki articles Words corresponding to the embedded multimedia
  • 23. Afterwards [P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and Back through Latent Topics,” WSDM, Rome, Italy, 2013] Words forming Article specific content wordsother Wiki articles Words corresponding to the embedded multimedia
  • 24. The family of multimedia topic models• Corr-MMGLDA: If a single topic generates a scene – the same topic generates all text in the document – a considerable strongpoint but a drawback for summary generation if this is not the case• MMGLDA: More diffuse translation of both visual and textual patterns through the latent translation spaces – Intuitively it aids frequency based summarizationMMGLDA Corr- Key is to use an asymmetric Dirichlet prior MMGLDA Document specific topic proportions Indicator variables Synopses words GIST features Visual “words” Topic Parameters for explaining latent structure within observation ensembles
  • 25. Topic modeling performance Test ELBOs on events 1-5 in  Prediction ELBOs on events the Dev-T set 1-5 in the Dev-T set Measuring held-out log  Measuring held-out log likelihoods on both videos and likelihoods on just videos in associated human summaries absence of the text  In a purely multinomial MMLDA model, failures of independent events contribute highly negative terms to the log likelihoods  Clearly NOT a measure of keyword summary generation power  For the MMGLDA family of models, Gaussian components can partially remove the independence through covariance modeling  This allows only the responsible topic-Gaussians to contribute to the likelihood
  • 26. Translating Related Words to Videos Corr-MMGLDA MMGLDA 1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log(α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log(α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
  • 27. Translating Related Words to Videos  Corr-MMGLDA is able to capture more variance relative to MMGLDA Corr-MMGLDA   for CorrMMGLDA is also slightly MMGLDA higher than that for MMGLDA  This can allow related but topically unique concepts to appear upfront 1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log(α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log(α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
  • 28. Related Words to Videos – Difficult Examples measure project lady tape indoor sew marker pleat highwaist zigzag scissor card mark teach cut fold stitch pin woman skirt machine fabric inside scissors make leather kilt man beltloop sew woman fabric make machine show baby traditional loom blouse outdoors blanket quick rectangle hood knit indoor stitch scissors pin cut iron studio montage measure kid penguin dad stuff thread
  • 29. Related Words to Videos – Difficult Examples clock mechanism repair computer tube wash machine lapse click desk mouse time front wd40 pliers reattach knob make level video water control person clip part wire inside indoor whirlpool man gear machine guy repair sew fan test make replace grease vintage motor box indoor man tutorial fuse bypass brush wrench repairman lubricate workshop bottom remove screw unscrew screwdriver video wire
  • 30. A few words is worth a thousand frames! From MMGLDA
  • 31. A few words is worth a thousand frames! From MMGLDA
  • 32. Event classification and summarization Sub-events e.g. skateboarding, Multiple sets of Multiple sentences (group of snowboarding, surfing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies Feeding  A c-SVM classier from the libSVM package is animals Bag of words used with default settings for multiclass (15 multi-document classes) classification summaries Woodworking  55% test accuracy easily achievable project Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  • 33. Event classification and summarization Sub-events e.g. skateboarding, Multiple sets of Multiple sentences (group of snowboarding, surfing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute - Usually changes from dataset to dataset Wedding event specific but max around 40-45% for 100 word ceremony vocabularies system summaries - If we can achieve 10% of this for 10 Feeding summaries, we are doing pretty word  A c-SVM classier from the libSVM package is animals good! Bag of words used with default Caveat – The text multi-document - settings for multiclass (15 multi-document classes) classification summaries summarization task is much more Woodworking  55% test accuracy easily achievable project complex than this simpler task Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  • 34. Future Directions: - Typically lots of features help inEvent classification and summarization classification but do we need all of Sub-events e.g. skateboarding, them for better summary generation? Multiple sets of - Does better event classification of Multiple sentences (group snowboarding, surfing documents (sets of segments in frames) performance always mean better frames in videos) summarization performance? Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies - Usually changes from dataset to dataset Feeding max around 40-45% for 100 word but  A c-SVM classier from the summaries system libSVM package is animals Bag of words used with default If we can achieve 10% of this for 10 - settings for multiclass (15 multi-document classes) classification summaries, we are doing pretty summaries word Woodworking  55% test accuracy easily achievable good! project Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  • 35. ROUGE-1 performance MMLDA can show poor ELBO – a bit misleading Performs quite well on predicting summary worthy keywords MMGLDA produces better topics and higher ELBO Summary worthiness of keywords almost same as MMLDA for lower n Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr- MGLDA w.r.t. quantitative evaluation Summary worthiness of keywords is not good but topics are good Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
  • 36. ROUGE-1 performance MMLDA can show poor ELBO – a bit misleading Performs quite well on predicting summary worthy keywords Future Directions MMGLDA produces better topics and higher ELBO  Need better initialization of Summary worthiness of keywords parameters priors governing almost same as MMLDA forvaluedndata for real lower [N. Nasios and A.G. Bors. Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man, and Sum-normalizing the real B: Cybernetics, 36(4):849 {862, 2006] Cybernetics, Part valued data to lie in [0,1]P distorts reality for Corr- MGLDA w.r.t. quantitative evaluation Summary worthiness of keywords is not good but topics are good Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
  • 37. Model usefulness and applications• Applications – Label topics through document level multimedia – Movie recommendations through semantically related frames – Video analysis: word prediction given video features – Adword creation through semantics of multimedia (Using transcripts only can be noisy) – Semantic compression of videos – Allowing the visually impaired to hear the world through text
  • 38. Long list of acknowledgements• Scott McCloskey (Honeywell ACS Labs)• Sangmin Oh, Amitha Perera (Kitware Inc.)• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)For helping us with feature extractions, event classification evaluations andmany fruitful discussions throughout this project • Jack Gallant (UC Berkeley) • Francisco Pereira (Siemens Corporate Research) For allowing us to reuse some of their illustrations in this presentation• Lucy Vanderwende (Microsoft Research)• Enrique Alfonseca (Google Research)For helpful discussions during TAC 2011 on the importance of thesummarization problem outside of the competitions on newswire collections
  • 39. Long list of acknowledgementsThis work was supported by the Intelligence Advanced ResearchProjects Activity (IARPA) via Department of Interior National BusinessCenter contract number D11PC20069. The U.S. Government isauthorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright annotation thereon.Disclaimer: The views and conclusions contained herein are those ofthe authors and should not be interpreted as necessarily representingthe official policies or endorsements, either expressed or implied, ofIARPA, DOI/NBC, or the U.S. Government.We also thank the anonymous reviewers for their comments
  • 40. Thanks!