TUD MediaEval 2012 Tagging Task
Reporter: Martha A. Larson
Multimedia Information Retrieval Lab
Delft University of Technology
05-10-2012




       Delft
       University of
       Technology

       Challenge the future
Outline

•  TUD-MM: Multi-modality video categorization with one-
vs-all classifiers
   •  Peng Xu, Yangyang Shi, Martha A. Larson

•  MediaEval 2012 Tagging Task: Prediction based on One Best
List and Confusion Networks
   •  Yangyang Shi, Martha A. Larson, Catholijn M. Jonker




                                         TUD MediaEval 2012 Tagging Task
                     Visual similarity measures for semantic video retrieval 	
   2
TUD-MM:Multi-modality video
categorization with one-vs-all
classifiers
Peng Xu, Yangyang Shi, Martha A. Larson
05-10-2012




         Delft
         University of
         Technology

         Challenge the future
Introduction
•  Features from different modalities
   •  Visual feature
        •  Visual Words based representation & Global video representation

   •  Text features
        •  ASR, Metadata

        •  Term-frequency, LDA

•  Classification and Fusion
   •  One-vs-all linear SVMs
   •  Reciprocal Rank Fusion
   •  Post-processing procedure to assign one category label for each video

                                               TUD MediaEval 2012 Tagging Task
                           Visual similarity measures for semantic video retrieval 	
   4
Visual representations
 •  Visual words based video representation
    •  SIFT features are extracted from each key-frame
    •  Visual vocabulary is build by hierarchical k-means clustering
    •  The normalized term-frequency of the entire video

 •  Global video representation
    •  Edit features
    •  Content features




                                              TUD MediaEval 2012 Tagging Task
                          Visual similarity measures for semantic video retrieval 	
   5
Classification and Fusion

    •  One-vs-all linear SVM
         •  C is determined by the 5-folder cross-validation

    •  Reciprocal Rank Fusion (RRF)*

         •  K=60 is to balance the importance of the lower ranked items
         •  The weights w(r) are determined by the cross-validation errors
         from each modalities

    •  Post-processing procedure
* G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms
Condorcet and individual rank learning methods. SIGIR '09, pages 758-759..	
         • 

                                                 TUD MediaEval 2012 Tagging Task
                             Visual similarity measures for semantic video retrieval 	
   6
Result analysis
•  MAP of different runs

           Run_1    Run_2      Run_3       Run_4        Run_5       *Run_6       *Run_7

MAP        0.0061   0.3127     0.2279      0.3675       0.2157      0.0577       0.0047

      •  Run_1 to Run_5 are official runs
      •  Run_6 is the visual-only run without post-processing
      •  Run_7 is the visual-only run with global feature




                                             TUD MediaEval 2012 Tagging Task
                         Visual similarity measures for semantic video retrieval 	
   7
Performance of visual features

                    Random basline                  VW                 Global
0,025


 0,02


0,015


 0,01


0,005


   0




                                          TUD MediaEval 2012 Tagging Task
                      Visual similarity measures for semantic video retrieval 	
   8
MediaEval 2012 Tagging Task:
Prediction based on One Best List and
Confusion Networks
Yangyang Shi, Martha A. Larson, Catholijn M. Jonker
05-10-2012




         Delft
         University of
         Technology

         Challenge the future
Models for One-best list and
   Confusion Networks


                        Dynamic
                        Bayesian
                        Networks
        Support                            Conditional
         vector                             random
        machine                              fields


                        ASR


                               TUD MediaEval 2012 Tagging Task
           Visual similarity measures for semantic video retrieval 	
   10
One-best List SVM


                                                      Linear
     Cut-off 3                                     kernel multi-
                            TF-IDF
    vocabulary                                      class SVM
                                                     (c=0.5)




                                     TUD MediaEval 2012 Tagging Task
                 Visual similarity measures for semantic video retrieval 	
   11
One-best List DBN
       E1            E2                        E3



  T1        T2                       T3




       W1           W2                       W3




                                TUD MediaEval 2012 Tagging Task
            Visual similarity measures for semantic video retrieval 	
   12
One-best List DBN

• 




                              TUD MediaEval 2012 Tagging Task
          Visual similarity measures for semantic video retrieval 	
   13
Results on Only ASR Run
           Models                          MAP
      Run2-one-best SVM                    0.23
      Run2-one-best DBN                    0.25
      Run2-one-best CRF                    0.10
        Run2-CN-CRF                        0.09




                                 TUD MediaEval 2012 Tagging Task
             Visual similarity measures for semantic video retrieval 	
   14
Average Precision on Each Genre
0,8
0,7
0,6
0,5
0,4
                                                                            DBN
0,3
0,2                                                                         SVM
0,1
  0




                              TUD MediaEval 2012 Tagging Task
          Visual similarity measures for semantic video retrieval 	
   15
Discussion and Future work
•  Discussion
    •  Visual only methods can be improved in several ways
         •  Features selection or dimensional reduction methods can be applied.
         •  Genre-level video representation

    •  CRF failure
         •  A document is treated as a item rather than one word.
         •  Feature size is too big to converge.

    • DBN outperforms SVM: The sequence order information probably helps
    prediction

•  Potentials
    •  Generate clear and useful labels
                          Visual similarity measures MediaEval 2012 Tagging Task 	
                                      Video Search Reranking for Genre retrieval
                                               TUD for semantic video Tagging	
       16
Thank you!



 Visual similarity measures for semantic Genre retrieval
             Video Search Reranking for video Tagging	
    17

TUD at MediaEval 2012 genre tagging task: Multi-modality video categorization with one-vs-all classifiers & MediaEval 2012 Tagging Task: Prediction based on One Best List and Confusion Networks

  • 1.
    TUD MediaEval 2012Tagging Task Reporter: Martha A. Larson Multimedia Information Retrieval Lab Delft University of Technology 05-10-2012 Delft University of Technology Challenge the future
  • 2.
    Outline •  TUD-MM: Multi-modalityvideo categorization with one- vs-all classifiers •  Peng Xu, Yangyang Shi, Martha A. Larson •  MediaEval 2012 Tagging Task: Prediction based on One Best List and Confusion Networks •  Yangyang Shi, Martha A. Larson, Catholijn M. Jonker TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 2
  • 3.
    TUD-MM:Multi-modality video categorization withone-vs-all classifiers Peng Xu, Yangyang Shi, Martha A. Larson 05-10-2012 Delft University of Technology Challenge the future
  • 4.
    Introduction •  Features fromdifferent modalities •  Visual feature •  Visual Words based representation & Global video representation •  Text features •  ASR, Metadata •  Term-frequency, LDA •  Classification and Fusion •  One-vs-all linear SVMs •  Reciprocal Rank Fusion •  Post-processing procedure to assign one category label for each video TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 4
  • 5.
    Visual representations • Visual words based video representation •  SIFT features are extracted from each key-frame •  Visual vocabulary is build by hierarchical k-means clustering •  The normalized term-frequency of the entire video •  Global video representation •  Edit features •  Content features TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 5
  • 6.
    Classification and Fusion •  One-vs-all linear SVM •  C is determined by the 5-folder cross-validation •  Reciprocal Rank Fusion (RRF)* •  K=60 is to balance the importance of the lower ranked items •  The weights w(r) are determined by the cross-validation errors from each modalities •  Post-processing procedure * G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. SIGIR '09, pages 758-759.. •  TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 6
  • 7.
    Result analysis •  MAPof different runs Run_1 Run_2 Run_3 Run_4 Run_5 *Run_6 *Run_7 MAP 0.0061 0.3127 0.2279 0.3675 0.2157 0.0577 0.0047 •  Run_1 to Run_5 are official runs •  Run_6 is the visual-only run without post-processing •  Run_7 is the visual-only run with global feature TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 7
  • 8.
    Performance of visualfeatures Random basline VW Global 0,025 0,02 0,015 0,01 0,005 0 TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 8
  • 9.
    MediaEval 2012 Tagging Task: Prediction basedon One Best List and Confusion Networks Yangyang Shi, Martha A. Larson, Catholijn M. Jonker 05-10-2012 Delft University of Technology Challenge the future
  • 10.
    Models for One-bestlist and Confusion Networks Dynamic Bayesian Networks Support Conditional vector random machine fields ASR TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 10
  • 11.
    One-best List SVM Linear Cut-off 3 kernel multi- TF-IDF vocabulary class SVM (c=0.5) TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 11
  • 12.
    One-best List DBN E1 E2 E3 T1 T2 T3 W1 W2 W3 TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 12
  • 13.
    One-best List DBN •  TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 13
  • 14.
    Results on OnlyASR Run Models MAP Run2-one-best SVM 0.23 Run2-one-best DBN 0.25 Run2-one-best CRF 0.10 Run2-CN-CRF 0.09 TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 14
  • 15.
    Average Precision onEach Genre 0,8 0,7 0,6 0,5 0,4 DBN 0,3 0,2 SVM 0,1 0 TUD MediaEval 2012 Tagging Task Visual similarity measures for semantic video retrieval 15
  • 16.
    Discussion and Futurework •  Discussion •  Visual only methods can be improved in several ways •  Features selection or dimensional reduction methods can be applied. •  Genre-level video representation •  CRF failure •  A document is treated as a item rather than one word. •  Feature size is too big to converge. • DBN outperforms SVM: The sequence order information probably helps prediction •  Potentials •  Generate clear and useful labels Visual similarity measures MediaEval 2012 Tagging Task Video Search Reranking for Genre retrieval TUD for semantic video Tagging 16
  • 17.
    Thank you! Visualsimilarity measures for semantic Genre retrieval Video Search Reranking for video Tagging 17