Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Multimodal Approach for Video Geocoding

1,633 views

Published on

  • Be the first to comment

A Multimodal Approach for Video Geocoding

  1. 1. A Multimodal Approach for Video Geocoding (UNICAMP at Placing Task MediaEval 2012)Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimarães Pedronette. Otávio A. B. Penatti. and Ricardo da S. Torres Institute of Computing - University of Campinas (UNICAMP) Brazil
  2. 2. Multimodal geocoding proposal
  3. 3. Textual features• Similarity functions: Okapi & Dice• Video metadata (run 1) – Title + description + keywords (Okapi_all) – Only description: Okapi_desc & Dice_desc – Combined result in run 1: • Okapi_all + Okapi_desc + Dice_desc• Photo tags (run 5) – Okapi function keywords (test video) X tags (3,185,258 Flickr photos)
  4. 4. Geocoding Visual ContentTest Video V1 Video Video The location (lat/long) of the most feature similarity similar video used as candidate for extraction computation the test video. Rankedlist of Vk similar videos Geocoded Video Video with lat/long. Tags. title. description.Development Set etc… Candidates lat/long(15,563 videos) + match score Bag of Scenes (photos) Histograms of Motion Patterns
  5. 5. Visual Features (HMP): Extracting• Histograms of Motion Patterns• Keyframes: Not used• Applying an algorithm to compare video sequence (1) partial decoding; (2) feature extraction; (3) signature generation. “Comparison of video sequences with histograms of motion patterns”, J. Almeida et al. ICIP, 2011.
  6. 6. Visual Features (HMP): overview[Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]
  7. 7. HMP: Comparing Video • Comparison of histograms can be performed by any vectorial distance function – like Manhattan (L1) or Euclidean (L2) • Video sequences compared with – Histogram intersection defined as: d ( ,  )   min(H i i v1 ,H )i v2 v v1 H 2 i i v1H vi  Histogram extracted from video Vi Output: range [0,1] 0 = not similar histogram 1= identical histogram
  8. 8. Visual Features: Bag-of-Scenes (BoS) Dictionary of Dictionary of local descriptions scenesFeature vector ... ... [Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]
  9. 9. Creating the dictionary Dictionary of scenes … Visual words Scenes Feature vectors selection Assignment PoolingUsing the dictionary … … Video feature vector … (bag-of-scenes) Video Frames Feature vectors
  10. 10. Data Fusion – Rank Aggregation• In multimedia applications, an approach for information fusion (considering different modalities) is essential for obtaining high effectiveness performance• Rank Aggregation: – Unsupervised approach for Data Fusion – Combination of different features using: Multiplication approach inspired by the Naive Bayes classifiers (assuming conditional independence among features) 10
  11. 11. Data Fusion – Rank Aggregation Textual Visual Feature Feature Visual Feature Combined ranked list 11
  12. 12. Data Fusion – Rank Aggregation : query video : dataset video : similarity score between videosSet of simiarlity functions defined by differentfeatures:New aggregated score computed by: 12
  13. 13. Runs Summary Description Descriptor used Okapi_all + Okapi_desc + Dice_descRun 1 Combine 3 Textual Combine 2 textual & Okapi_all + Okapi_desc + HMP +Run 2 2 visual BoS_CEDD5000Run 3 Single visual: HMP HMP (last year visual) HMP + BoS5000 + BoS500Run 4 Combine 3 visual Textual: Flickr photosRun 5 Okapi on keywords tags as geo-profile
  14. 14. Results for Test SetRadius Run 1 Run 2 Run 3 Run 4 Run 5 1 21.40% 22.29% 15.81% 15.93% 9.28% 10 30.68% 31.25% 16.07% 16.09% 19.44% 100 35.39% 36.42% 16.62% 17.07% 24.13% 200 37.37% 38.40% 17.58% 17.86% 25.85% 500 41.77% 43.35% 19.68% 19.97% 29.29%1,000 45.38% 47.68% 24.77% 25.47% 33.91% 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 5,000 62.29% 66.91% 45.34% 45.34% 65.73%10,000 85.27% 87.95% 81.95% 81.73% 87.69%15,000 95.89% 96.80% 95.79% 95.70% 96.17% Result of classical text vector space combined (run 1) is twice as good as the use of visual cues alone (run 4).
  15. 15. Results for Test SetRadius Run 1 Run 2 Run 3 Run 4 Run 5 1 21.40% 22.29% 15.81% 15.93% 9.28% 10 30.68% 31.25% 16.07% 16.09% 19.44% 100 35.39% 36.42% 16.62% 17.07% 24.13% 200 37.37% 38.40% 17.58% 17.86% 25.85% 500 41.77% 43.35% 19.68% 19.97% 29.29%1,000 45.38% 47.68% 24.77% 25.47% 33.91% 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 5,000 62.29% 66.91% 45.34% 45.34% 65.73%10,000 85.27% 87.95% 81.95% 81.73% 87.69%15,000 95.89% 96.80% 95.79% 95.70% 96.17% Run 5 results (photos metadata functioned as geo-profile) worse than Run 4 (visual information only) at 1 km precision. However, for other radii, run 5 is better than runs 3 and 4.
  16. 16. Results for Test Set Radius Run 1 Run 2 Run 3 Run 4 Run 5 Confidence Interval (99%) run2 1 21.40% 22.29% 15.81% 15.93% 9.28% run 1 10 30.68% 31.25% 16.07% 16.09% 19.44% 4500 100 35.39% 36.42% 16.62% 17.07% 24.13% 4400 200 37.37% 38.40% 17.58% 17.86% 25.85% 4300 4200 Distance (km) 500 41.77% 43.35% 19.68% 19.97% 29.29% 4100 1,000 45.38% 47.68% 24.77% 25.47% 33.91% 4000 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 3900 5,000 62.29% 66.91% 45.34% 45.34% 65.73% 3800 3700 10,000 85.27% 87.95% 81.95% 81.73% 87.69% 3600 15,000 95.89% 96.80% 95.79% 95.70% 96.17% 3500 Combination of different textual and visual descriptors (run 2)leads to statistically significant improvements (confidence >= 0.99) over results of using only on textual clues (run1)
  17. 17. Results for Test Set 2011’s result for HMPRadius Run 1 Run 2 Run 3 Run 4 Run 5 1 21.40% 22.29% 15.81% 15.93% 9.28% 0.21% 10 30.68% 31.25% 16.07% 16.09% 19.44% 1.12% 100 35.39% 36.42% 16.62% 17.07% 24.13% 2.71% 200 37.37% 38.40% 17.58% 17.86% 25.85% 3.33% 500 41.77% 43.35% 19.68% 19.97% 29.29% 6.08%1,000 45.38% 47.68% 24.77% 25.47% 33.91% 12.16% 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 22.11% 5,000 62.29% 66.91% 45.34% 45.34% 65.73% 37.78%10,000 85.27% 87.95% 81.95% 81.73% 87.69% 79.45% Run 3 using only HMP (our last year approach) -- performs much better with this years data set.Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?
  18. 18. Conclusion• Textual features: Okapi & Dice• Visual features: HMP & BoS – HMP results: better in 2012 than 2011 – Is it due to bigger development set?• Combined textual information (video) and visual features • Ranked lists • Promising results: combining yields better results than single modality• Future improvement by – other strategies for combining different modalities – other information sources filter out noisy data from ranked lists (e.g., Geonames and Wikipedia)
  19. 19. Acknowledgements & contacts• RECOD Lab @ Institute of Computing, UNICAMP (University of Campinas)• VoD Lab @ UFMG (Universidade Federal de Minas Gerais)• Organizers of Placing Task and MediaEval 2012• Brazilian funding agencies CAPES, FAPESP, CNPq Contact email: {lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br

×