SlideShare a Scribd company logo
1 of 19
Download to read offline
A Multimodal Approach for
           Video Geocoding
        (UNICAMP at Placing Task MediaEval 2012)
Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimarães Pedronette.
            Otávio A. B. Penatti. and Ricardo da S. Torres
   Institute of Computing - University of Campinas (UNICAMP)
                                 Brazil
Multimodal geocoding proposal
Textual features
• Similarity functions: Okapi & Dice
• Video metadata (run 1)
  – Title + description + keywords (Okapi_all)
  – Only description: Okapi_desc & Dice_desc
  – Combined result in run 1:
      • Okapi_all + Okapi_desc + Dice_desc
• Photo tags (run 5)
  – Okapi function
   keywords (test video) X tags (3,185,258 Flickr photos)
Geocoding Visual Content
Test Video
                     V1
          Video                      Video                 The location (lat/long) of the most
         feature                   similarity              similar video used as candidate for
        extraction               computation                the test video.

                                              Rankedlist of
                          Vk                  similar videos



       Geocoded
         Video
                           Video with lat/long.
                           Tags. title. description.
Development Set
                             etc…                                    Candidates lat/long
(15,563 videos)
                                                                     + match score

     Bag of Scenes (photos)
     Histograms of Motion Patterns
Visual Features (HMP): Extracting

• Histograms of Motion Patterns
• Keyframes: Not used
• Applying an algorithm to compare video
  sequence
  (1) partial decoding;
  (2) feature extraction;
  (3) signature generation.

   “Comparison of video sequences with histograms of
        motion patterns”, J. Almeida et al. ICIP, 2011.
Visual Features (HMP): overview




[Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]
HMP: Comparing Video
 • Comparison of histograms can be performed
   by any vectorial distance function
    – like Manhattan (L1) or Euclidean (L2)
 • Video sequences compared with
    – Histogram intersection defined as:

       d ( ,  ) 
                     min(H      i
                                               i
                                               v1   ,H )i
                                                        v2
           v   v1
                       H
                        2
                                           i
                                               i
                                               v1

H vi  Histogram extracted from video Vi            Output: range [0,1]
                                                    0 = not similar histogram
                                                    1= identical histogram
Visual Features: Bag-of-Scenes (BoS)

                    Dictionary of                             Dictionary of
                 local descriptions                              scenes




Feature
 vector
                                    ...                                       ...




  [Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]
Creating the dictionary

                                                                                 Dictionary of
                                                                                    scenes

                                                                                      …



                                                         Visual words
                          Scenes   Feature vectors
                                                           selection



                                                                    Assignment              Pooling
Using the dictionary




                                                                                        …
                                                                                        …
                                                                         Video feature vector
                                     …




                                                                           (bag-of-scenes)
                          Video    Frames       Feature vectors
Data Fusion – Rank Aggregation
• In multimedia applications, an approach for
  information fusion (considering different
  modalities) is essential for obtaining high
  effectiveness performance
• Rank Aggregation:
  – Unsupervised approach for Data Fusion
  – Combination of different features using:
    Multiplication approach inspired by the Naive
    Bayes classifiers (assuming conditional
    independence among features)                    10
Data Fusion – Rank Aggregation

                   Textual
    Visual
                   Feature
   Feature


          Visual
         Feature




  Combined ranked list           11
Data Fusion – Rank Aggregation
  : query video                    : dataset video

                  : similarity score between videos

Set of simiarlity functions defined by different
features:


New aggregated score computed by:


                                                      12
Runs Summary
           Description                   Descriptor used
                                 Okapi_all + Okapi_desc + Dice_desc
Run 1    Combine 3 Textual

      Combine 2 textual &         Okapi_all + Okapi_desc + HMP +
Run 2
           2 visual                      BoS_CEDD5000
Run 3    Single visual: HMP            HMP (last year visual)
                                     HMP + BoS5000 + BoS500
Run 4     Combine 3 visual

        Textual: Flickr photos
Run 5                                   Okapi on keywords
         tags as geo-profile
Results for Test Set
Radius    Run 1    Run 2         Run 3        Run 4         Run 5
    1    21.40%   22.29%        15.81%       15.93%        9.28%
   10    30.68%   31.25%        16.07%       16.09%        19.44%
  100    35.39%   36.42%        16.62%       17.07%        24.13%
  200    37.37%   38.40%        17.58%       17.86%        25.85%
  500    41.77%   43.35%        19.68%       19.97%        29.29%
1,000    45.38%   47.68%        24.77%       25.47%        33.91%
 2,000   53.32%   56.03%        33.48%       33.31%        46.05%
 5,000   62.29%   66.91%        45.34%       45.34%        65.73%
10,000   85.27%   87.95%        81.95%       81.73%        87.69%
15,000   95.89%   96.80%        95.79%       95.70%        96.17%



         Result of classical text vector space combined (run 1)
                           is twice as good as
                  the use of visual cues alone (run 4).
Results for Test Set
Radius    Run 1     Run 2         Run 3        Run 4        Run 5
   1     21.40%    22.29%        15.81%       15.93%       9.28%
  10     30.68%    31.25%        16.07%       16.09%       19.44%
  100    35.39%    36.42%        16.62%       17.07%       24.13%
  200    37.37%    38.40%        17.58%       17.86%       25.85%
  500    41.77%    43.35%        19.68%       19.97%       29.29%
1,000    45.38%    47.68%        24.77%       25.47%       33.91%
 2,000   53.32%    56.03%        33.48%       33.31%       46.05%
 5,000   62.29%    66.91%        45.34%       45.34%       65.73%
10,000   85.27%    87.95%        81.95%       81.73%       87.69%
15,000   95.89%    96.80%        95.79%       95.70%       96.17%



     Run 5 results (photos metadata functioned as geo-profile)
    worse than Run 4 (visual information only) at 1 km precision.
     However, for other radii, run 5 is better than runs 3 and 4.
Results for Test Set
    Radius    Run 1    Run 2    Run 3    Run 4    Run 5
                                                             Confidence Interval (99%)   run2
       1     21.40%   22.29%   15.81%   15.93%   9.28%
                                                                                         run 1
      10     30.68%   31.25%   16.07%   16.09%   19.44%
                                                                          4500
      100    35.39%   36.42%   16.62%   17.07%   24.13%                   4400
      200    37.37%   38.40%   17.58%   17.86%   25.85%                   4300
                                                                          4200




                                                          Distance (km)
      500    41.77%   43.35%   19.68%   19.97%   29.29%
                                                                          4100
    1,000    45.38%   47.68%   24.77%   25.47%   33.91%                   4000
     2,000   53.32%   56.03%   33.48%   33.31%   46.05%                   3900
     5,000   62.29%   66.91%   45.34%   45.34%   65.73%                   3800
                                                                          3700
    10,000   85.27%   87.95%   81.95%   81.73%   87.69%                   3600
    15,000   95.89%   96.80%   95.79%   95.70%   96.17%                   3500




  Combination of different textual and visual descriptors (run 2)
leads to statistically significant improvements (confidence >= 0.99)
         over results of using only on textual clues (run1)
Results for Test Set
                                                      2011’s result for HMP
Radius    Run 1    Run 2    Run 3    Run 4    Run 5
   1     21.40%   22.29%   15.81%   15.93%   9.28%          0.21%
  10     30.68%   31.25%   16.07%   16.09%   19.44%          1.12%
  100    35.39%   36.42%   16.62%   17.07%   24.13%         2.71%
  200    37.37%   38.40%   17.58%   17.86%   25.85%          3.33%
  500    41.77%   43.35%   19.68%   19.97%   29.29%          6.08%
1,000    45.38%   47.68%   24.77%   25.47%   33.91%         12.16%
 2,000   53.32%   56.03%   33.48%   33.31%   46.05%         22.11%
 5,000   62.29%   66.91%   45.34%   45.34%   65.73%
                                                            37.78%
10,000   85.27%   87.95%   81.95%   81.73%   87.69%
                                                            79.45%


         Run 3 using only HMP (our last year approach) -- performs much
                          better with this year's data set.

Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?
Conclusion
• Textual features: Okapi & Dice
• Visual features: HMP & BoS
   – HMP results: better in 2012 than 2011
   – Is it due to bigger development set?
• Combined textual information (video) and visual
  features
   • Ranked lists
   • Promising results: combining yields better results than
     single modality
• Future improvement by
   – other strategies for combining different modalities
   – other information sources filter out noisy data from ranked
     lists (e.g., Geonames and Wikipedia)
Acknowledgements & contacts
• RECOD Lab @ Institute of Computing, UNICAMP
  (University of Campinas)
• VoD Lab @ UFMG
  (Universidade Federal de Minas Gerais)
• Organizers of Placing Task and MediaEval 2012
• Brazilian funding agencies
  CAPES, FAPESP, CNPq
                                                          Contact email:
        {lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br

More Related Content

What's hot

SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path RenderingSIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path RenderingMark Kilgard
 
MPEG 3D graphics compression offer
MPEG 3D graphics compression offerMPEG 3D graphics compression offer
MPEG 3D graphics compression offerMarius Preda PhD
 
Point Cloud Compression in MPEG
Point Cloud Compression in MPEGPoint Cloud Compression in MPEG
Point Cloud Compression in MPEGMarius Preda PhD
 
GPU - DirectX 10.1 Architecture White paper
GPU - DirectX 10.1 Architecture White paperGPU - DirectX 10.1 Architecture White paper
GPU - DirectX 10.1 Architecture White paperBenson Tao
 
COLLADA to WebGL (GDC 2013 presentation)
COLLADA to WebGL (GDC 2013 presentation)COLLADA to WebGL (GDC 2013 presentation)
COLLADA to WebGL (GDC 2013 presentation)Remi Arnaud
 

What's hot (6)

SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path RenderingSIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
 
MPEG 3D graphics compression offer
MPEG 3D graphics compression offerMPEG 3D graphics compression offer
MPEG 3D graphics compression offer
 
Point Cloud Compression in MPEG
Point Cloud Compression in MPEGPoint Cloud Compression in MPEG
Point Cloud Compression in MPEG
 
GPU - DirectX 10.1 Architecture White paper
GPU - DirectX 10.1 Architecture White paperGPU - DirectX 10.1 Architecture White paper
GPU - DirectX 10.1 Architecture White paper
 
COLLADA to WebGL (GDC 2013 presentation)
COLLADA to WebGL (GDC 2013 presentation)COLLADA to WebGL (GDC 2013 presentation)
COLLADA to WebGL (GDC 2013 presentation)
 
3DgraphicsAndAR
3DgraphicsAndAR3DgraphicsAndAR
3DgraphicsAndAR
 

Viewers also liked

Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012MediaEval2012
 
DCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
DCU Search Runs at MediaEval 2012: Search and Hyperlinking TaskDCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
DCU Search Runs at MediaEval 2012: Search and Hyperlinking TaskMediaEval2012
 
MediaEval 2012 Opening
MediaEval 2012 OpeningMediaEval 2012 Opening
MediaEval 2012 OpeningMediaEval2012
 
Brave New Task: Musiclef Multimodal Music Tagging
Brave New Task: Musiclef Multimodal Music TaggingBrave New Task: Musiclef Multimodal Music Tagging
Brave New Task: Musiclef Multimodal Music TaggingMediaEval2012
 
Social Event Detection at MediaEval 2012: Challenges, Dataset and Evaluation
Social Event Detection at MediaEval 2012: Challenges, Dataset and EvaluationSocial Event Detection at MediaEval 2012: Challenges, Dataset and Evaluation
Social Event Detection at MediaEval 2012: Challenges, Dataset and EvaluationMediaEval2012
 
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...MediaEval2012
 
The CLEF Initiative From 2010 to 2012 and Onwards
The CLEF Initiative From 2010 to 2012 and OnwardsThe CLEF Initiative From 2010 to 2012 and Onwards
The CLEF Initiative From 2010 to 2012 and OnwardsMediaEval2012
 
CUNI at MediaEval 2012: Search and Hyperlinking Task
CUNI at MediaEval 2012: Search and Hyperlinking TaskCUNI at MediaEval 2012: Search and Hyperlinking Task
CUNI at MediaEval 2012: Search and Hyperlinking TaskMediaEval2012
 
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...MediaEval2012
 
LIG at MediaEval 2012 affect task: use of a generic method
LIG at MediaEval 2012 affect task: use of a generic methodLIG at MediaEval 2012 affect task: use of a generic method
LIG at MediaEval 2012 affect task: use of a generic methodMediaEval2012
 
The MediaEval 2012 Affect Task: Violent Scenes Detectio
The MediaEval 2012 Affect Task: Violent Scenes DetectioThe MediaEval 2012 Affect Task: Violent Scenes Detectio
The MediaEval 2012 Affect Task: Violent Scenes DetectioMediaEval2012
 
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect TaskNII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect TaskMediaEval2012
 

Viewers also liked (13)

Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012
 
DCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
DCU Search Runs at MediaEval 2012: Search and Hyperlinking TaskDCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
DCU Search Runs at MediaEval 2012: Search and Hyperlinking Task
 
MediaEval 2012 Opening
MediaEval 2012 OpeningMediaEval 2012 Opening
MediaEval 2012 Opening
 
Brave New Task: Musiclef Multimodal Music Tagging
Brave New Task: Musiclef Multimodal Music TaggingBrave New Task: Musiclef Multimodal Music Tagging
Brave New Task: Musiclef Multimodal Music Tagging
 
Closing
ClosingClosing
Closing
 
Social Event Detection at MediaEval 2012: Challenges, Dataset and Evaluation
Social Event Detection at MediaEval 2012: Challenges, Dataset and EvaluationSocial Event Detection at MediaEval 2012: Challenges, Dataset and Evaluation
Social Event Detection at MediaEval 2012: Challenges, Dataset and Evaluation
 
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
Ghent University-IBBT at MediaEval 2012 Search and Hyperlinking: Semantic Sim...
 
The CLEF Initiative From 2010 to 2012 and Onwards
The CLEF Initiative From 2010 to 2012 and OnwardsThe CLEF Initiative From 2010 to 2012 and Onwards
The CLEF Initiative From 2010 to 2012 and Onwards
 
CUNI at MediaEval 2012: Search and Hyperlinking Task
CUNI at MediaEval 2012: Search and Hyperlinking TaskCUNI at MediaEval 2012: Search and Hyperlinking Task
CUNI at MediaEval 2012: Search and Hyperlinking Task
 
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
Violence Detection in Video by Large Scale Multi-Scale Local Binary Pattern D...
 
LIG at MediaEval 2012 affect task: use of a generic method
LIG at MediaEval 2012 affect task: use of a generic methodLIG at MediaEval 2012 affect task: use of a generic method
LIG at MediaEval 2012 affect task: use of a generic method
 
The MediaEval 2012 Affect Task: Violent Scenes Detectio
The MediaEval 2012 Affect Task: Violent Scenes DetectioThe MediaEval 2012 Affect Task: Violent Scenes Detectio
The MediaEval 2012 Affect Task: Violent Scenes Detectio
 
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect TaskNII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
NII, Japan at MediaEval 2012 Violent Scenes Detection Affect Task
 

Similar to A Multimodal Approach for Video Geocoding

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Droidcon2013 miracast final2
Droidcon2013 miracast final2Droidcon2013 miracast final2
Droidcon2013 miracast final2Droidcon Berlin
 
[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...
[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...
[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...Pei-Hsuan (Ike) Tsai
 
martelli.ppt
martelli.pptmartelli.ppt
martelli.pptVideoguy
 
KIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
KIT at MediaEval 2012 – Content–based Genre Classification with Visual CuesKIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
KIT at MediaEval 2012 – Content–based Genre Classification with Visual CuesMediaEval2012
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptxcomputerscience98
 
VRE Definition And Creation
VRE Definition And CreationVRE Definition And Creation
VRE Definition And CreationFAO
 
426 lecture6a osgART Development
426 lecture6a osgART Development426 lecture6a osgART Development
426 lecture6a osgART DevelopmentMark Billinghurst
 
MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014
MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014
MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014multimediaeval
 
FIWARE Tech Summit - Challenges of Streaming HQ 360 Videos
FIWARE Tech Summit - Challenges of Streaming HQ 360 VideosFIWARE Tech Summit - Challenges of Streaming HQ 360 Videos
FIWARE Tech Summit - Challenges of Streaming HQ 360 VideosFIWARE
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformFadwa Fouad
 
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...multimediaeval
 
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Md. Mehedi Hasan
 
Video Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAPVideo Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAPinventionjournals
 
OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingAlpen-Adria-Universität
 

Similar to A Multimodal Approach for Video Geocoding (20)

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Droidcon2013 miracast final2
Droidcon2013 miracast final2Droidcon2013 miracast final2
Droidcon2013 miracast final2
 
Video+Language: From Classification to Description
Video+Language: From Classification to DescriptionVideo+Language: From Classification to Description
Video+Language: From Classification to Description
 
Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
Video + Language
Video + LanguageVideo + Language
Video + Language
 
[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...
[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...
[Paper Presentation] FoveAR: Combining an Optically See-Through Near-Eye Disp...
 
D25014017
D25014017D25014017
D25014017
 
Video Classification Basic
Video Classification Basic Video Classification Basic
Video Classification Basic
 
martelli.ppt
martelli.pptmartelli.ppt
martelli.ppt
 
KIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
KIT at MediaEval 2012 – Content–based Genre Classification with Visual CuesKIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
KIT at MediaEval 2012 – Content–based Genre Classification with Visual Cues
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
VRE Definition And Creation
VRE Definition And CreationVRE Definition And Creation
VRE Definition And Creation
 
426 lecture6a osgART Development
426 lecture6a osgART Development426 lecture6a osgART Development
426 lecture6a osgART Development
 
MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014
MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014
MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014
 
FIWARE Tech Summit - Challenges of Streaming HQ 360 Videos
FIWARE Tech Summit - Challenges of Streaming HQ 360 VideosFIWARE Tech Summit - Challenges of Streaming HQ 360 Videos
FIWARE Tech Summit - Challenges of Streaming HQ 360 Videos
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
 
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
 
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
 
Video Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAPVideo Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAP
 
OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video Streaming
 

More from MediaEval2012

Brave New Task: User Account Matching
Brave New Task: User Account MatchingBrave New Task: User Account Matching
Brave New Task: User Account MatchingMediaEval2012
 
Overview of MediaEval 2012 Visual Privacy Task
Overview of MediaEval 2012 Visual Privacy TaskOverview of MediaEval 2012 Visual Privacy Task
Overview of MediaEval 2012 Visual Privacy TaskMediaEval2012
 
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...MediaEval2012
 
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...MediaEval2012
 
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...MediaEval2012
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...MediaEval2012
 
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...MediaEval2012
 
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...MediaEval2012
 
UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
UNICAMP-UFMG at MediaEval 2012: Genre Tagging TaskUNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
UNICAMP-UFMG at MediaEval 2012: Genre Tagging TaskMediaEval2012
 
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...MediaEval2012
 
ARF @ MediaEval 2012: Multimodal Video Classification
ARF @ MediaEval 2012: Multimodal Video ClassificationARF @ MediaEval 2012: Multimodal Video Classification
ARF @ MediaEval 2012: Multimodal Video ClassificationMediaEval2012
 
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...MediaEval2012
 
Overview of the MediaEval 2012 Tagging Task
Overview of the MediaEval 2012 Tagging TaskOverview of the MediaEval 2012 Tagging Task
Overview of the MediaEval 2012 Tagging TaskMediaEval2012
 
Telefonica Research System for the Spoken Web Search task at Mediaeval 2012
Telefonica Research System for the Spoken Web Search task at Mediaeval 2012Telefonica Research System for the Spoken Web Search task at Mediaeval 2012
Telefonica Research System for the Spoken Web Search task at Mediaeval 2012MediaEval2012
 
CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012MediaEval2012
 
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search TaskThe TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search TaskMediaEval2012
 

More from MediaEval2012 (17)

Brave New Task: User Account Matching
Brave New Task: User Account MatchingBrave New Task: User Account Matching
Brave New Task: User Account Matching
 
Overview of MediaEval 2012 Visual Privacy Task
Overview of MediaEval 2012 Visual Privacy TaskOverview of MediaEval 2012 Visual Privacy Task
Overview of MediaEval 2012 Visual Privacy Task
 
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
MediaEval 2012 Visual Privacy Task: Privacy and Intelligibility through Pixel...
 
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
MediaEval 2012 Visual Privacy Task: Applying Transform-domain Scrambling to A...
 
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature...
 
mevd2012 esra_
 mevd2012 esra_ mevd2012 esra_
mevd2012 esra_
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
 
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywo...
 
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Tr...
 
UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
UNICAMP-UFMG at MediaEval 2012: Genre Tagging TaskUNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task
 
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization...
 
ARF @ MediaEval 2012: Multimodal Video Classification
ARF @ MediaEval 2012: Multimodal Video ClassificationARF @ MediaEval 2012: Multimodal Video Classification
ARF @ MediaEval 2012: Multimodal Video Classification
 
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visu...
 
Overview of the MediaEval 2012 Tagging Task
Overview of the MediaEval 2012 Tagging TaskOverview of the MediaEval 2012 Tagging Task
Overview of the MediaEval 2012 Tagging Task
 
Telefonica Research System for the Spoken Web Search task at Mediaeval 2012
Telefonica Research System for the Spoken Web Search task at Mediaeval 2012Telefonica Research System for the Spoken Web Search task at Mediaeval 2012
Telefonica Research System for the Spoken Web Search task at Mediaeval 2012
 
CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012CUHK System for the Spoken Web Search task at Mediaeval 2012
CUHK System for the Spoken Web Search task at Mediaeval 2012
 
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search TaskThe TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
 

A Multimodal Approach for Video Geocoding

  • 1. A Multimodal Approach for Video Geocoding (UNICAMP at Placing Task MediaEval 2012) Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimarães Pedronette. Otávio A. B. Penatti. and Ricardo da S. Torres Institute of Computing - University of Campinas (UNICAMP) Brazil
  • 3. Textual features • Similarity functions: Okapi & Dice • Video metadata (run 1) – Title + description + keywords (Okapi_all) – Only description: Okapi_desc & Dice_desc – Combined result in run 1: • Okapi_all + Okapi_desc + Dice_desc • Photo tags (run 5) – Okapi function keywords (test video) X tags (3,185,258 Flickr photos)
  • 4. Geocoding Visual Content Test Video V1 Video Video The location (lat/long) of the most feature similarity similar video used as candidate for extraction computation the test video. Rankedlist of Vk similar videos Geocoded Video Video with lat/long. Tags. title. description. Development Set etc… Candidates lat/long (15,563 videos) + match score Bag of Scenes (photos) Histograms of Motion Patterns
  • 5. Visual Features (HMP): Extracting • Histograms of Motion Patterns • Keyframes: Not used • Applying an algorithm to compare video sequence (1) partial decoding; (2) feature extraction; (3) signature generation. “Comparison of video sequences with histograms of motion patterns”, J. Almeida et al. ICIP, 2011.
  • 6. Visual Features (HMP): overview [Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]
  • 7. HMP: Comparing Video • Comparison of histograms can be performed by any vectorial distance function – like Manhattan (L1) or Euclidean (L2) • Video sequences compared with – Histogram intersection defined as: d ( ,  )   min(H i i v1 ,H )i v2 v v1 H 2 i i v1 H vi  Histogram extracted from video Vi Output: range [0,1] 0 = not similar histogram 1= identical histogram
  • 8. Visual Features: Bag-of-Scenes (BoS) Dictionary of Dictionary of local descriptions scenes Feature vector ... ... [Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]
  • 9. Creating the dictionary Dictionary of scenes … Visual words Scenes Feature vectors selection Assignment Pooling Using the dictionary … … Video feature vector … (bag-of-scenes) Video Frames Feature vectors
  • 10. Data Fusion – Rank Aggregation • In multimedia applications, an approach for information fusion (considering different modalities) is essential for obtaining high effectiveness performance • Rank Aggregation: – Unsupervised approach for Data Fusion – Combination of different features using: Multiplication approach inspired by the Naive Bayes classifiers (assuming conditional independence among features) 10
  • 11. Data Fusion – Rank Aggregation Textual Visual Feature Feature Visual Feature Combined ranked list 11
  • 12. Data Fusion – Rank Aggregation : query video : dataset video : similarity score between videos Set of simiarlity functions defined by different features: New aggregated score computed by: 12
  • 13. Runs Summary Description Descriptor used Okapi_all + Okapi_desc + Dice_desc Run 1 Combine 3 Textual Combine 2 textual & Okapi_all + Okapi_desc + HMP + Run 2 2 visual BoS_CEDD5000 Run 3 Single visual: HMP HMP (last year visual) HMP + BoS5000 + BoS500 Run 4 Combine 3 visual Textual: Flickr photos Run 5 Okapi on keywords tags as geo-profile
  • 14. Results for Test Set Radius Run 1 Run 2 Run 3 Run 4 Run 5 1 21.40% 22.29% 15.81% 15.93% 9.28% 10 30.68% 31.25% 16.07% 16.09% 19.44% 100 35.39% 36.42% 16.62% 17.07% 24.13% 200 37.37% 38.40% 17.58% 17.86% 25.85% 500 41.77% 43.35% 19.68% 19.97% 29.29% 1,000 45.38% 47.68% 24.77% 25.47% 33.91% 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 5,000 62.29% 66.91% 45.34% 45.34% 65.73% 10,000 85.27% 87.95% 81.95% 81.73% 87.69% 15,000 95.89% 96.80% 95.79% 95.70% 96.17% Result of classical text vector space combined (run 1) is twice as good as the use of visual cues alone (run 4).
  • 15. Results for Test Set Radius Run 1 Run 2 Run 3 Run 4 Run 5 1 21.40% 22.29% 15.81% 15.93% 9.28% 10 30.68% 31.25% 16.07% 16.09% 19.44% 100 35.39% 36.42% 16.62% 17.07% 24.13% 200 37.37% 38.40% 17.58% 17.86% 25.85% 500 41.77% 43.35% 19.68% 19.97% 29.29% 1,000 45.38% 47.68% 24.77% 25.47% 33.91% 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 5,000 62.29% 66.91% 45.34% 45.34% 65.73% 10,000 85.27% 87.95% 81.95% 81.73% 87.69% 15,000 95.89% 96.80% 95.79% 95.70% 96.17% Run 5 results (photos metadata functioned as geo-profile) worse than Run 4 (visual information only) at 1 km precision. However, for other radii, run 5 is better than runs 3 and 4.
  • 16. Results for Test Set Radius Run 1 Run 2 Run 3 Run 4 Run 5 Confidence Interval (99%) run2 1 21.40% 22.29% 15.81% 15.93% 9.28% run 1 10 30.68% 31.25% 16.07% 16.09% 19.44% 4500 100 35.39% 36.42% 16.62% 17.07% 24.13% 4400 200 37.37% 38.40% 17.58% 17.86% 25.85% 4300 4200 Distance (km) 500 41.77% 43.35% 19.68% 19.97% 29.29% 4100 1,000 45.38% 47.68% 24.77% 25.47% 33.91% 4000 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 3900 5,000 62.29% 66.91% 45.34% 45.34% 65.73% 3800 3700 10,000 85.27% 87.95% 81.95% 81.73% 87.69% 3600 15,000 95.89% 96.80% 95.79% 95.70% 96.17% 3500 Combination of different textual and visual descriptors (run 2) leads to statistically significant improvements (confidence >= 0.99) over results of using only on textual clues (run1)
  • 17. Results for Test Set 2011’s result for HMP Radius Run 1 Run 2 Run 3 Run 4 Run 5 1 21.40% 22.29% 15.81% 15.93% 9.28% 0.21% 10 30.68% 31.25% 16.07% 16.09% 19.44% 1.12% 100 35.39% 36.42% 16.62% 17.07% 24.13% 2.71% 200 37.37% 38.40% 17.58% 17.86% 25.85% 3.33% 500 41.77% 43.35% 19.68% 19.97% 29.29% 6.08% 1,000 45.38% 47.68% 24.77% 25.47% 33.91% 12.16% 2,000 53.32% 56.03% 33.48% 33.31% 46.05% 22.11% 5,000 62.29% 66.91% 45.34% 45.34% 65.73% 37.78% 10,000 85.27% 87.95% 81.95% 81.73% 87.69% 79.45% Run 3 using only HMP (our last year approach) -- performs much better with this year's data set. Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?
  • 18. Conclusion • Textual features: Okapi & Dice • Visual features: HMP & BoS – HMP results: better in 2012 than 2011 – Is it due to bigger development set? • Combined textual information (video) and visual features • Ranked lists • Promising results: combining yields better results than single modality • Future improvement by – other strategies for combining different modalities – other information sources filter out noisy data from ranked lists (e.g., Geonames and Wikipedia)
  • 19. Acknowledgements & contacts • RECOD Lab @ Institute of Computing, UNICAMP (University of Campinas) • VoD Lab @ UFMG (Universidade Federal de Minas Gerais) • Organizers of Placing Task and MediaEval 2012 • Brazilian funding agencies CAPES, FAPESP, CNPq Contact email: {lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br