A Multimodal Approach for Video Geocoding

A Multimodal Approach for
Video Geocoding
(UNICAMP at Placing Task MediaEval 2012)
Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimarães Pedronette.
Otávio A. B. Penatti. and Ricardo da S. Torres
Institute of Computing - University of Campinas (UNICAMP)
Brazil

Textual features
• Similarity functions: Okapi & Dice
• Video metadata (run 1)
– Title + description + keywords (Okapi_all)
– Only description: Okapi_desc & Dice_desc
– Combined result in run 1:
• Okapi_all + Okapi_desc + Dice_desc
• Photo tags (run 5)
– Okapi function
keywords (test video) X tags (3,185,258 Flickr photos)

Geocoding Visual Content
Test Video
V1
Video Video The location (lat/long) of the most
feature similarity similar video used as candidate for
extraction computation the test video.

Rankedlist of
Vk similar videos

Geocoded
Video
Video with lat/long.
Tags. title. description.
Development Set
etc… Candidates lat/long
(15,563 videos)
+ match score

Bag of Scenes (photos)
Histograms of Motion Patterns

Visual Features (HMP): Extracting

• Histograms of Motion Patterns
• Keyframes: Not used
• Applying an algorithm to compare video
sequence
(1) partial decoding;
(2) feature extraction;
(3) signature generation.

“Comparison of video sequences with histograms of
motion patterns”, J. Almeida et al. ICIP, 2011.

Visual Features (HMP): overview

[Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]

HMP: Comparing Video
• Comparison of histograms can be performed
by any vectorial distance function
– like Manhattan (L1) or Euclidean (L2)
• Video sequences compared with
– Histogram intersection defined as:

d ( ,  ) 
 min(H i
i
v1 ,H )i
v2
v v1
H
2
i
i
v1

H vi  Histogram extracted from video Vi Output: range [0,1]
0 = not similar histogram
1= identical histogram

Visual Features: Bag-of-Scenes (BoS)

Dictionary of Dictionary of
local descriptions scenes

Feature
vector
... ...

[Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]

Creating the dictionary

Dictionary of
scenes

…

Visual words
Scenes Feature vectors
selection

Assignment Pooling
Using the dictionary

…
…
Video feature vector
…

(bag-of-scenes)
Video Frames Feature vectors

Data Fusion – Rank Aggregation
• In multimedia applications, an approach for
information fusion (considering different
modalities) is essential for obtaining high
effectiveness performance
• Rank Aggregation:
– Unsupervised approach for Data Fusion
– Combination of different features using:
Multiplication approach inspired by the Naive
Bayes classifiers (assuming conditional
independence among features) 10


Textual
Visual
Feature
Feature

Visual
Feature

Combined ranked list 11

: query video : dataset video

: similarity score between videos

Set of simiarlity functions defined by different
features:

New aggregated score computed by:

12

Runs Summary
Description Descriptor used
Okapi_all + Okapi_desc + Dice_desc
Run 1 Combine 3 Textual

Combine 2 textual & Okapi_all + Okapi_desc + HMP +
Run 2
2 visual BoS_CEDD5000
Run 3 Single visual: HMP HMP (last year visual)
HMP + BoS5000 + BoS500
Run 4 Combine 3 visual

Textual: Flickr photos
Run 5 Okapi on keywords
tags as geo-profile

Results for Test Set
Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Result of classical text vector space combined (run 1)
is twice as good as
the use of visual cues alone (run 4).

1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Run 5 results (photos metadata functioned as geo-profile)
worse than Run 4 (visual information only) at 1 km precision.
However, for other radii, run 5 is better than runs 3 and 4.

Confidence Interval (99%) run2
1 21.40% 22.29% 15.81% 15.93% 9.28%
run 1
10 30.68% 31.25% 16.07% 16.09% 19.44%
4500
100 35.39% 36.42% 16.62% 17.07% 24.13% 4400
200 37.37% 38.40% 17.58% 17.86% 25.85% 4300
4200

Distance (km)
500 41.77% 43.35% 19.68% 19.97% 29.29%
4100
1,000 45.38% 47.68% 24.77% 25.47% 33.91% 4000
2,000 53.32% 56.03% 33.48% 33.31% 46.05% 3900
5,000 62.29% 66.91% 45.34% 45.34% 65.73% 3800
3700
10,000 85.27% 87.95% 81.95% 81.73% 87.69% 3600
15,000 95.89% 96.80% 95.79% 95.70% 96.17% 3500

Combination of different textual and visual descriptors (run 2)
leads to statistically significant improvements (confidence >= 0.99)
over results of using only on textual clues (run1)

2011’s result for HMP
1 21.40% 22.29% 15.81% 15.93% 9.28% 0.21%
10 30.68% 31.25% 16.07% 16.09% 19.44% 1.12%
100 35.39% 36.42% 16.62% 17.07% 24.13% 2.71%
200 37.37% 38.40% 17.58% 17.86% 25.85% 3.33%
500 41.77% 43.35% 19.68% 19.97% 29.29% 6.08%
1,000 45.38% 47.68% 24.77% 25.47% 33.91% 12.16%
2,000 53.32% 56.03% 33.48% 33.31% 46.05% 22.11%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
37.78%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
79.45%

Run 3 using only HMP (our last year approach) -- performs much
better with this year's data set.

Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?

Conclusion
• Textual features: Okapi & Dice
• Visual features: HMP & BoS
– HMP results: better in 2012 than 2011
– Is it due to bigger development set?
• Combined textual information (video) and visual
features
• Ranked lists
• Promising results: combining yields better results than
single modality
• Future improvement by
– other strategies for combining different modalities
– other information sources filter out noisy data from ranked
lists (e.g., Geonames and Wikipedia)

Acknowledgements & contacts
• RECOD Lab @ Institute of Computing, UNICAMP
(University of Campinas)
• VoD Lab @ UFMG
(Universidade Federal de Minas Gerais)
• Organizers of Placing Task and MediaEval 2012
• Brazilian funding agencies
CAPES, FAPESP, CNPq
Contact email:
{lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br

A Multimodal Approach for Video Geocoding

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (13)

Similar to A Multimodal Approach for Video Geocoding

Similar to A Multimodal Approach for Video Geocoding (20)

More from MediaEval2012

More from MediaEval2012 (17)

A Multimodal Approach for Video Geocoding