Television Linked To The Web



               Daniel Stein1, Evlampios Apostolidis2, Vasileios Mezaris2,
                      Nicolas de Abreu Pereira3, Jennifer Müller3,
                    Mathilde Sahuguet4, Benoit Huet4, Ivo Lašek5


  Enrichment of News Show Videos with
  Multimodal Semi-Automatic Analysis
1 Fraunhofer Institute IAIS, Schloss Birlinghoven, Germany
2 Information Technologies Institute, CERTH, Greece
3 rbb - Rundfunk Berlin-Brandenburg, 14482 Potsdam, Germany
4 Eurecom, Sophia Antpolis, France
5 Czech Technical University and University of Economics, Prague, Czech Republic
     NEM Summit, Istanbul,
                                         www.linkedtv.eu
        October 2012
Synopsis
                                                                www.linkedtv.eu




   Introduction: LinkedTV Project
   Use Cases
   Intelligent Video Analysis
   Results
   Conclusions & future plans




2                               Information Technologies Institute
                                Centre for Research and Technology Hellas
LinkedTV ― Television Linked To the Web
                                                                            www.linkedtv.eu


   Vision:                                  12 Excellent Partners
     hypervideo                          Fraunhofer      Eurecom
     ubiquitously online cloud of        STI GMBH        Condat
      Networked Audio-Visual Content      CERTH           BEELD EN GELUID
     decoupled from place, device or     UEP             Noterik
      source                              UMONS           U. ST GALLEN
                                          CWI             RBB
   Aim:
     provide interactive multimedia

      service for non-professional end-
      users
     Focus on television broadcast

      content as seed videos

   Web: http://www.linkedtv.eu


    3                                       Information Technologies Institute
                                            Centre for Research and Technology Hellas
LinkedTV Workflow
                                                                                 www.linkedtv.eu




    Overall Architecture

                                  Use Case Scenarios



     Intelligent Video Analysis


     Linking Hypervideo to Web Content
                                                        Contextualization
                                                        and Personalization
     Interface and Presentation Engine




4                                                Information Technologies Institute
                                                 Centre for Research and Technology Hellas
LinkedTV Workflow
                                                                                 www.linkedtv.eu




    Overall Architecture

                                  Use Case Scenarios



     Intelligent Video Analysis


     Linking Hypervideo to Web Content
                                                        Contextualization
                                                        and Personalization
     Interface and Presentation Engine




5                                                Information Technologies Institute
                                                 Centre for Research and Technology Hellas
Two Use Case Scenarios in LinkedTV
                                                                        www.linkedtv.eu


Scenario 1 (this talk):
Interactive News Show
 Professional news
                           Due to legal constraints: whitelist
                           Detailed scenario archetype description
    content produced by
    RBB                         News topic,
 Seed content: local           people,
    news show "rbb              locations,
    Aktuell"                    objects etc
                                                 Scenario 2
                                                 (not covered here):
                                                 Hyperlinked Documentary
                                                    Cultural content from
                                                     S&V (1700 hours of
                                                     cultural heritage AV-
                                                     content under CCL)
                                                    Seed content:
                                                     "Antique Roadshow"
                                                                                    6
6                                       Information Technologies Institute
                                        Centre for Research and Technology Hellas
Intelligent Video Analysis
                                                             www.linkedtv.eu




7                            Information Technologies Institute
                             Centre for Research and Technology Hellas
Segmentation
                                                                                                                           www.linkedtv.eu




   Shot segmentation technique                                    Spatio-temporal Segmentation
   [Tsamoura et. al., 2008]                                       [Mezaris et. al., 2004]
   News show video performance:                                   News show performance: Good
  “remarkably well”                                                False positives due to:
   Out of 269 shots detected:                                            Camera movement or zoom in/out (~ 55 %)
           2 had wrong starting points                                   Gradual transition between frames (~ 10 %)
                                                                          Erroneous motion vectors (~ 35 %)
           4 contained multiple shots


           11 were too short to evaluate
                                                                   Unwanted effect: false recognition of
          properly                                                  moving banners which do not yield
                                                                    additional information




V. Mezaris, I. Kompatsiaris, N. V. Boulgouris, and M. G. Strintzis, "Real-time compressed-domain spatiotemporal segmentation and ontologies
for video indexing and retrieval", IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 5, pp. 606-621, May 2004.

E. Tsamoura, V. Mezaris, I. Kompatsiaris, "Gradual transition detection using color coherence and other criteria in a video shot meta-
segmentation framework", IEEE International Conference on Image Processing, Workshop on Multimedia Information Retrieval (ICIP-MIR
2008), San Diego, CA, USA, October 2008, pp. 45-48.


      8                                                                           Information Technologies Institute
                                                                                  Centre for Research and Technology Hellas
Concept Detection
                                                                                                                              www.linkedtv.eu




       Method was described in [Moumtzidou et.
        al., 2011]
       346 concepts from TRECVID 2011 SIN task
       Overall performance:
          Correctly detected concepts > 64 %
          About 25 % of them are characterized as
           particularly useful mostly related to detecting
           persons (e.g., person, face, adult)
          Erroneous concepts vary between 22% -
           42% and in many cases achieve high scores
           (e.g., outdoor, amateur video)

                                                                                      Visit: http://mklab.iti.gr/eventdetection-linkedtv/



A. Moumtzidou, P. Sidiropoulos, S. Vrochidis, N. Gkalelis, S. Nikolopoulos, V. Mezaris, I. Kompatsiaris, I. Patras, "ITI-CERTH participation
to TRECVID 2011", Proc. TRECVID 2011 Workshop, December 2011, Gaithersburg, MD, USA.


    9                                                                               Information Technologies Institute
                                                                                    Centre for Research and Technology Hellas
Automatic Speech Recognition
                                                                                                                       www.linkedtv.eu


 Automatic speech recognition for German (using [Schneider08]):

        segment of one news show                WER            notes
        new airport                             36.2           outdoor, spontaneous
        soccer riot                             44.2           tavern, dialect, background noise
        various news I                          9.5

        murder case                             24.0

        boxing                                  50.6           dialect, very spontaneous
        various news II                         20.9

        rbb game                                39.1

        weather report                          46.7           spontaneous, casual

  main obstacles: local dialect, spontaneous speech, background noise

 Schneider, D., Schon, J., and Eickeler, S. (2008). Towards Large Scale Vocabulary Independent Spoken Term Detection: Advances
 in the Fraunhofer IAIS Audiomining System. In Proc. SIGIR, Singapore.


   10                                                                          Information Technologies Institute
                                                                               Centre for Research and Technology Hellas
Person Detection
                                                                              www.linkedtv.eu




 Face clustering using the face.com api  Speaker Identification using a GMM-
 Result: generally very good, some      HMM model, with 253 German parliament
erroneous clusters due to side-view      speakers
                                          Result: 8.0% Equal Error Rate
 11                                           Information Technologies Institute
                                              Centre for Research and Technology Hellas
Conclusions
                                                                                         www.linkedtv.eu



    We have established:
       all the different video analysis techniques
       their exact functionality
       the connections among them


    Preliminary results work as a solid ground for further improvements


    Many challenges have been addressed but several aspects of the analysis
     techniques show much room for improvement, e.g.,
       over-sensitivity of spatiotemporal segmentation algorithm to gradual transitions
        and camera’s movement
       adaptation of several TRECVID concepts to the needs of each specific multimedia
        content (news show, documentary, art show)
       over-sensitivity of speech recognizer to localized dialects and background noise



12                                                       Information Technologies Institute
                                                         Centre for Research and Technology Hellas
Future Plans
                                                                                www.linkedtv.eu



  Incorporate new methods:
       Near-duplicate Content Detection
          Goal: find parts that are already watched
       Optical Character Recognition
          Goal: exploit banner information to obtain a database for face and
            speaker recognition
       Topic Segmentation
          Goal: improve scene segmentation

  Find synergies between methods:
       ASR + Speaker Recognition + Face Detection
           Person Detection
       ASR + Topic Classification + Shot Segmentation
           Story Segmentation
       Concept Detection + Keywords Extraction + Topic Segmentation
           Video Similarity/Clustering



13                                              Information Technologies Institute
                                                Centre for Research and Technology Hellas
www.linkedtv.eu




                         Questions ?



More information:
http://www.iti.gr/~bmezaris
bmezaris@iti.gr

http://www.linkedtv.eu

14                                 Information Technologies Institute
                                   Centre for Research and Technology Hellas

Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis

  • 1.
    Television Linked ToThe Web Daniel Stein1, Evlampios Apostolidis2, Vasileios Mezaris2, Nicolas de Abreu Pereira3, Jennifer Müller3, Mathilde Sahuguet4, Benoit Huet4, Ivo Lašek5 Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis 1 Fraunhofer Institute IAIS, Schloss Birlinghoven, Germany 2 Information Technologies Institute, CERTH, Greece 3 rbb - Rundfunk Berlin-Brandenburg, 14482 Potsdam, Germany 4 Eurecom, Sophia Antpolis, France 5 Czech Technical University and University of Economics, Prague, Czech Republic NEM Summit, Istanbul, www.linkedtv.eu October 2012
  • 2.
    Synopsis www.linkedtv.eu  Introduction: LinkedTV Project  Use Cases  Intelligent Video Analysis  Results  Conclusions & future plans 2 Information Technologies Institute Centre for Research and Technology Hellas
  • 3.
    LinkedTV ― TelevisionLinked To the Web www.linkedtv.eu  Vision: 12 Excellent Partners  hypervideo Fraunhofer Eurecom  ubiquitously online cloud of STI GMBH Condat Networked Audio-Visual Content CERTH BEELD EN GELUID  decoupled from place, device or UEP Noterik source UMONS U. ST GALLEN CWI RBB  Aim:  provide interactive multimedia service for non-professional end- users  Focus on television broadcast content as seed videos  Web: http://www.linkedtv.eu 3 Information Technologies Institute Centre for Research and Technology Hellas
  • 4.
    LinkedTV Workflow www.linkedtv.eu Overall Architecture Use Case Scenarios Intelligent Video Analysis Linking Hypervideo to Web Content Contextualization and Personalization Interface and Presentation Engine 4 Information Technologies Institute Centre for Research and Technology Hellas
  • 5.
    LinkedTV Workflow www.linkedtv.eu Overall Architecture Use Case Scenarios Intelligent Video Analysis Linking Hypervideo to Web Content Contextualization and Personalization Interface and Presentation Engine 5 Information Technologies Institute Centre for Research and Technology Hellas
  • 6.
    Two Use CaseScenarios in LinkedTV www.linkedtv.eu Scenario 1 (this talk): Interactive News Show  Professional news  Due to legal constraints: whitelist  Detailed scenario archetype description content produced by RBB News topic,  Seed content: local people, news show "rbb locations, Aktuell" objects etc Scenario 2 (not covered here): Hyperlinked Documentary  Cultural content from S&V (1700 hours of cultural heritage AV- content under CCL)  Seed content: "Antique Roadshow" 6 6 Information Technologies Institute Centre for Research and Technology Hellas
  • 7.
    Intelligent Video Analysis www.linkedtv.eu 7 Information Technologies Institute Centre for Research and Technology Hellas
  • 8.
    Segmentation www.linkedtv.eu  Shot segmentation technique  Spatio-temporal Segmentation  [Tsamoura et. al., 2008]  [Mezaris et. al., 2004]  News show video performance:  News show performance: Good “remarkably well”  False positives due to:  Out of 269 shots detected:  Camera movement or zoom in/out (~ 55 %)  2 had wrong starting points  Gradual transition between frames (~ 10 %)  Erroneous motion vectors (~ 35 %)  4 contained multiple shots  11 were too short to evaluate  Unwanted effect: false recognition of properly moving banners which do not yield additional information V. Mezaris, I. Kompatsiaris, N. V. Boulgouris, and M. G. Strintzis, "Real-time compressed-domain spatiotemporal segmentation and ontologies for video indexing and retrieval", IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 5, pp. 606-621, May 2004. E. Tsamoura, V. Mezaris, I. Kompatsiaris, "Gradual transition detection using color coherence and other criteria in a video shot meta- segmentation framework", IEEE International Conference on Image Processing, Workshop on Multimedia Information Retrieval (ICIP-MIR 2008), San Diego, CA, USA, October 2008, pp. 45-48. 8 Information Technologies Institute Centre for Research and Technology Hellas
  • 9.
    Concept Detection www.linkedtv.eu  Method was described in [Moumtzidou et. al., 2011]  346 concepts from TRECVID 2011 SIN task  Overall performance:  Correctly detected concepts > 64 %  About 25 % of them are characterized as particularly useful mostly related to detecting persons (e.g., person, face, adult)  Erroneous concepts vary between 22% - 42% and in many cases achieve high scores (e.g., outdoor, amateur video) Visit: http://mklab.iti.gr/eventdetection-linkedtv/ A. Moumtzidou, P. Sidiropoulos, S. Vrochidis, N. Gkalelis, S. Nikolopoulos, V. Mezaris, I. Kompatsiaris, I. Patras, "ITI-CERTH participation to TRECVID 2011", Proc. TRECVID 2011 Workshop, December 2011, Gaithersburg, MD, USA. 9 Information Technologies Institute Centre for Research and Technology Hellas
  • 10.
    Automatic Speech Recognition www.linkedtv.eu  Automatic speech recognition for German (using [Schneider08]): segment of one news show WER notes new airport 36.2 outdoor, spontaneous soccer riot 44.2 tavern, dialect, background noise various news I 9.5 murder case 24.0 boxing 50.6 dialect, very spontaneous various news II 20.9 rbb game 39.1 weather report 46.7 spontaneous, casual  main obstacles: local dialect, spontaneous speech, background noise Schneider, D., Schon, J., and Eickeler, S. (2008). Towards Large Scale Vocabulary Independent Spoken Term Detection: Advances in the Fraunhofer IAIS Audiomining System. In Proc. SIGIR, Singapore. 10 Information Technologies Institute Centre for Research and Technology Hellas
  • 11.
    Person Detection www.linkedtv.eu  Face clustering using the face.com api  Speaker Identification using a GMM-  Result: generally very good, some HMM model, with 253 German parliament erroneous clusters due to side-view speakers  Result: 8.0% Equal Error Rate 11 Information Technologies Institute Centre for Research and Technology Hellas
  • 12.
    Conclusions www.linkedtv.eu  We have established:  all the different video analysis techniques  their exact functionality  the connections among them  Preliminary results work as a solid ground for further improvements  Many challenges have been addressed but several aspects of the analysis techniques show much room for improvement, e.g.,  over-sensitivity of spatiotemporal segmentation algorithm to gradual transitions and camera’s movement  adaptation of several TRECVID concepts to the needs of each specific multimedia content (news show, documentary, art show)  over-sensitivity of speech recognizer to localized dialects and background noise 12 Information Technologies Institute Centre for Research and Technology Hellas
  • 13.
    Future Plans www.linkedtv.eu  Incorporate new methods:  Near-duplicate Content Detection  Goal: find parts that are already watched  Optical Character Recognition  Goal: exploit banner information to obtain a database for face and speaker recognition  Topic Segmentation  Goal: improve scene segmentation  Find synergies between methods:  ASR + Speaker Recognition + Face Detection  Person Detection  ASR + Topic Classification + Shot Segmentation  Story Segmentation  Concept Detection + Keywords Extraction + Topic Segmentation  Video Similarity/Clustering 13 Information Technologies Institute Centre for Research and Technology Hellas
  • 14.
    www.linkedtv.eu Questions ? More information: http://www.iti.gr/~bmezaris bmezaris@iti.gr http://www.linkedtv.eu 14 Information Technologies Institute Centre for Research and Technology Hellas