Translating Related Words to Videos and Back through Latent Topics

Principal Software Engineer at SmartFocus US, Inc.
Feb. 9, 2013
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
Translating Related Words to Videos and Back through Latent Topics
1 of 40

More Related Content

Similar to Translating Related Words to Videos and Back through Latent Topics

Lesson 4   visions of the future 1 - v2Lesson 4   visions of the future 1 - v2
Lesson 4 visions of the future 1 - v2Boojie Cowell
Deep Learning - What's the buzz all aboutDeep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDebdoot Sheet
The Mobile Virtual CaneThe Mobile Virtual Cane
The Mobile Virtual CaneInteractive Technologies and Games: Education, Health and Disability
Where is my mind?Where is my mind?
Where is my mind?Nicolas Rougier
Thesis final SUBMITTEDThesis final SUBMITTED
Thesis final SUBMITTEDMeera Paleja, PhD
That's not what I meant! - Fran Alexander That's not what I meant! - Fran Alexander
That's not what I meant! - Fran Alexander Incisive_Events

Recently uploaded

Industry 4.0.pdfIndustry 4.0.pdf
Industry 4.0.pdfTery Lockitski
Smart Contracts - The Blockchain Beyond BitcoinSmart Contracts - The Blockchain Beyond Bitcoin
Smart Contracts - The Blockchain Beyond BitcoinJim McKeeth
From Consumer to Creator, A Guide to iOS Open SourceFrom Consumer to Creator, A Guide to iOS Open Source
From Consumer to Creator, A Guide to iOS Open SourceMax Cobb
Value proposition of SSI tech providers - Self-Sovereign IdentityValue proposition of SSI tech providers - Self-Sovereign Identity
Value proposition of SSI tech providers - Self-Sovereign IdentitySSIMeetup
who we are - values.pptxwho we are - values.pptx
who we are - values.pptxLauraGarceran
Microsoft Azure New - Sep 2023Microsoft Azure New - Sep 2023
Microsoft Azure New - Sep 2023Daniel Toomey

Recently uploaded(20)

Translating Related Words to Videos and Back through Latent Topics

Editor's Notes

  1. Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
  2. Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
  3. Centers are the topics which correspond to some best description of data which are similar in some wayTrue Centers are never known---each one of us has an algorithm for finding centers---our own topic model
  4. The actual ground-truth synopses overlaid over the training topics
  5. BOLD (Blood Oxygen Level Dependent) and fMRI patternsImages used with permission from Jack Gallant and Francisco Pereira (by the way, both of them are now applying topic models to map brain patterns to movies or text)
  6. A genuine philanthropic use case
  7. The importance of relating multi-document summaries to that for summarizing videos – every frame is a document
  8. Psycholinguistics are needed to confirm but that’s not a concern at this pointIn our dataset we have only one ground truth summary---base case for ROUGE evaluation
  9. Ground truth annotationComplex high level descriptionsSpoken Language is complicated – We are corresponding it to a minimal set of features (next)
  10. Upper row – training (camera motion and shakes are a real problem for maintaining the bounding boxes)Lower row – trained models
  11. Role of alpha – alpha provides a topic for every observation. Alpha is a K-vectorHere each component of alpha is different which helps assign different proportions of observations differently (e.g. one topic can be focusing solely on “stop-words”, another one on “commonly occurring words” and other ones on the different topics etc.)
  12. Translation formula (Marginalization over topics)- If there are two topics i.e. K=2, then (for e.g for the 2nd term) 0.5*0.5 + 0.5*0.5 = 0.5 < 0*0.0001 + 0.9*0.9- Values of the inferred \\phi’s are very important for the real valued data – separated Gaussians are better but does not always happen- This raises an issue where the real valued data may need to be preprocessed to increase the chances of separation
  13. Object Bank (Computed on keyframes), HOG3D and Color histograms – features through the lens of computer vision
  14. Important references: http://cvcl.mit.edu/papers/oliva04.pdf | http://vision.stanford.edu/VSS2007-NaturalScene/Oliva_VSS07.pdf- The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns should be different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
  15. Red arrow means “lack of the corresponding GIST property” and green means ok- The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns are different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
  16. - The dataset that we use for the video summarization task is released as part of NIST's 2011 TRECVID Multimedia Event Detection (MED) evaluation set. The dataset consists of a collection of Internet multimedia content posted to the various Internet video hosting sites. The training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Working on a woodworking project etc.We use the videos and their textual metadata in all the 15 events as training data. There are 2062 clips with summaries in the training set with almost equal distribution amongst the events. The test set which we use is called the TransparentDevelopment (Dev-T) collection. The Dev-T collection includes positive instances of the first 5 training events and near positive instances for the last 10 events---a total of 630 videos labeled with event category information (and associated human synopses which are to be compared against for summarization performance). Each summary is a short and very high level description of the entire video and ranges from 2 to 40 words but on average 10 words (with stopwords). We remove standard English stopwords and retain only the word morphologies (not required) from the synopses as our training vocabularies. The proportion of videos belonging to events 6 through 15 in the Dev-T set is much low compared to the proportion for the other events since those clips are considered to be “related" instances which cover only part of the event category specifications. The performances of our topic models are evaluated on those kinds of clips as well. The numbers of videos in events 6 through 15 in the Dev-T set are {4,9,5,7,8,3,3,3,10,8} while there are around 120 videos per event for the first 5 events. All other videos in the Dev-T set neither have any event category label nor are identified as positive, negative or related videos and we do not consider these videos in our experiments.
  17. There are no individual summaries for shots within the clip – only one high level summaryProblems with shot-wise nearest neighbor matching precisely for this reason?
  18. Why event specific vocabularies
  19. Modeling correspondence of caption words to the main text content which can be annotated in various ways
  20. “Dear Wikipedia readers: We are the small non-profit that runs the #5 website in the world. We have only 150 staff but serve 450 million users” – finding the reason why it might be so? (Both the main and embedded content reflects coherent topics e.g. if there appears an irrelevant advertisement, the topic will drift and Wikipedia will loose its appeal)
  21. Corr-MMGLDA seems to be capturing more variance relative to MMGLDA\\alpha for CorrMMGLDA is thus slightly higher than that for MMGLDATopic parameters over words are seeded through documents during initialization and hence are same for both models here
  22. This is a tough event to match words with frames. The event is “Working on a sewing project”Top row: frames coming from only one video. We do not put a constraint that we can select only 5 frames per video. Although this can be easily done. The shown video’s actual synopsis is “One lady is doing sewing project indoors.”Bottom row: better variance – Note how it captures dad sewing kid’s penguin with a needle and threadFirst row: “Woman demonstrating different stitches using a serger/sewing machine”Second row: “dad sewing up stuffed penguin for kids”Third row: “Woman makes a bordered hem skirt.”; Last one: “A pair of hands do a sewing project using a sewing machine.”Other features might help: Action, objects, GIST and color may not be enough
  23. This is again another tough event to match words with frames. The event is “Repairing an appliance”Top row: frames coming from only one video. Bad example. The shown video’s actual synopsis is “How to repair the water level control mechanism on a Whirlpool washing machine.”Bottom row: better variance – Row1,Cols1-3: “a man is repairing a whirlpool washer” ;Row1,Col4 “how to remove blockage from a washing machine pump”; Row2,Cols1-3: “Woman demonstrates replacing a door hinge on a dishwasher”;Row2,Col4: “A guy shows how to make repairs on a microwave”;Row3,Cols1-3: “How to fix a broken agitator on a Whirlpool washing machine”;Row3,Col4: “A guy working on a vintage box fan”Other features might help: Action, objects, GIST and color not enough
  24. Usually changes from dataset to dataset but max around 40-45% for 100 word summariesIf we can achieve 10% of this for 10 word summaries, we are doing pretty good!
  25. Caveat – The text multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
  26. Caveat – The multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
  27. Purely multinomial topic models showing lower ELBOs can perform quite well in BoW summarization. MMLDA assigns likelihoods based on success and failure of independent events and failures contribute highly negative terms to the log likelihoods but this does not indicate the model's summarization performance where low probability terms are pruned out. Gaussian components can partially remove the independence through covariance modeling but this can also allow different but related topics to model GIST features almost equally (strong overlap in the tail of the bell shaped curves - Gaussians) and show poor permutation of predicted words due to the violation of the soft probabilistic constraint of correspondence
  28. There has been some work done for initialization of priors for a Gaussian Mixture Model (GMM) setting but no work has been done on the effects of such initializations for topic models involving Gaussians and Multinomials
  29. Never had the chance to acknowledge them all in the paper