The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
A Multimodal Approach for Video Geocoding
1. A Multimodal Approach for
Video Geocoding
(UNICAMP at Placing Task MediaEval 2012)
Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimarães Pedronette.
Otávio A. B. Penatti. and Ricardo da S. Torres
Institute of Computing - University of Campinas (UNICAMP)
Brazil
3. Textual features
• Similarity functions: Okapi & Dice
• Video metadata (run 1)
– Title + description + keywords (Okapi_all)
– Only description: Okapi_desc & Dice_desc
– Combined result in run 1:
• Okapi_all + Okapi_desc + Dice_desc
• Photo tags (run 5)
– Okapi function
keywords (test video) X tags (3,185,258 Flickr photos)
4. Geocoding Visual Content
Test Video
V1
Video Video The location (lat/long) of the most
feature similarity similar video used as candidate for
extraction computation the test video.
Rankedlist of
Vk similar videos
Geocoded
Video
Video with lat/long.
Tags. title. description.
Development Set
etc… Candidates lat/long
(15,563 videos)
+ match score
Bag of Scenes (photos)
Histograms of Motion Patterns
5. Visual Features (HMP): Extracting
• Histograms of Motion Patterns
• Keyframes: Not used
• Applying an algorithm to compare video
sequence
(1) partial decoding;
(2) feature extraction;
(3) signature generation.
“Comparison of video sequences with histograms of
motion patterns”, J. Almeida et al. ICIP, 2011.
6. Visual Features (HMP): overview
[Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]
7. HMP: Comparing Video
• Comparison of histograms can be performed
by any vectorial distance function
– like Manhattan (L1) or Euclidean (L2)
• Video sequences compared with
– Histogram intersection defined as:
d ( , )
min(H i
i
v1 ,H )i
v2
v v1
H
2
i
i
v1
H vi Histogram extracted from video Vi Output: range [0,1]
0 = not similar histogram
1= identical histogram
8. Visual Features: Bag-of-Scenes (BoS)
Dictionary of Dictionary of
local descriptions scenes
Feature
vector
... ...
[Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]
9. Creating the dictionary
Dictionary of
scenes
…
Visual words
Scenes Feature vectors
selection
Assignment Pooling
Using the dictionary
…
…
Video feature vector
…
(bag-of-scenes)
Video Frames Feature vectors
10. Data Fusion – Rank Aggregation
• In multimedia applications, an approach for
information fusion (considering different
modalities) is essential for obtaining high
effectiveness performance
• Rank Aggregation:
– Unsupervised approach for Data Fusion
– Combination of different features using:
Multiplication approach inspired by the Naive
Bayes classifiers (assuming conditional
independence among features) 10
11. Data Fusion – Rank Aggregation
Textual
Visual
Feature
Feature
Visual
Feature
Combined ranked list 11
12. Data Fusion – Rank Aggregation
: query video : dataset video
: similarity score between videos
Set of simiarlity functions defined by different
features:
New aggregated score computed by:
12
13. Runs Summary
Description Descriptor used
Okapi_all + Okapi_desc + Dice_desc
Run 1 Combine 3 Textual
Combine 2 textual & Okapi_all + Okapi_desc + HMP +
Run 2
2 visual BoS_CEDD5000
Run 3 Single visual: HMP HMP (last year visual)
HMP + BoS5000 + BoS500
Run 4 Combine 3 visual
Textual: Flickr photos
Run 5 Okapi on keywords
tags as geo-profile
14. Results for Test Set
Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
15,000 95.89% 96.80% 95.79% 95.70% 96.17%
Result of classical text vector space combined (run 1)
is twice as good as
the use of visual cues alone (run 4).
15. Results for Test Set
Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
15,000 95.89% 96.80% 95.79% 95.70% 96.17%
Run 5 results (photos metadata functioned as geo-profile)
worse than Run 4 (visual information only) at 1 km precision.
However, for other radii, run 5 is better than runs 3 and 4.
16. Results for Test Set
Radius Run 1 Run 2 Run 3 Run 4 Run 5
Confidence Interval (99%) run2
1 21.40% 22.29% 15.81% 15.93% 9.28%
run 1
10 30.68% 31.25% 16.07% 16.09% 19.44%
4500
100 35.39% 36.42% 16.62% 17.07% 24.13% 4400
200 37.37% 38.40% 17.58% 17.86% 25.85% 4300
4200
Distance (km)
500 41.77% 43.35% 19.68% 19.97% 29.29%
4100
1,000 45.38% 47.68% 24.77% 25.47% 33.91% 4000
2,000 53.32% 56.03% 33.48% 33.31% 46.05% 3900
5,000 62.29% 66.91% 45.34% 45.34% 65.73% 3800
3700
10,000 85.27% 87.95% 81.95% 81.73% 87.69% 3600
15,000 95.89% 96.80% 95.79% 95.70% 96.17% 3500
Combination of different textual and visual descriptors (run 2)
leads to statistically significant improvements (confidence >= 0.99)
over results of using only on textual clues (run1)
17. Results for Test Set
2011’s result for HMP
Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28% 0.21%
10 30.68% 31.25% 16.07% 16.09% 19.44% 1.12%
100 35.39% 36.42% 16.62% 17.07% 24.13% 2.71%
200 37.37% 38.40% 17.58% 17.86% 25.85% 3.33%
500 41.77% 43.35% 19.68% 19.97% 29.29% 6.08%
1,000 45.38% 47.68% 24.77% 25.47% 33.91% 12.16%
2,000 53.32% 56.03% 33.48% 33.31% 46.05% 22.11%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
37.78%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
79.45%
Run 3 using only HMP (our last year approach) -- performs much
better with this year's data set.
Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?
18. Conclusion
• Textual features: Okapi & Dice
• Visual features: HMP & BoS
– HMP results: better in 2012 than 2011
– Is it due to bigger development set?
• Combined textual information (video) and visual
features
• Ranked lists
• Promising results: combining yields better results than
single modality
• Future improvement by
– other strategies for combining different modalities
– other information sources filter out noisy data from ranked
lists (e.g., Geonames and Wikipedia)
19. Acknowledgements & contacts
• RECOD Lab @ Institute of Computing, UNICAMP
(University of Campinas)
• VoD Lab @ UFMG
(Universidade Federal de Minas Gerais)
• Organizers of Placing Task and MediaEval 2012
• Brazilian funding agencies
CAPES, FAPESP, CNPq
Contact email:
{lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br