How world-class product teams are winning in the AI era by CEO and Founder, P...
Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis
1. Television Linked To The Web
Daniel Stein1, Evlampios Apostolidis2, Vasileios Mezaris2,
Nicolas de Abreu Pereira3, Jennifer Müller3,
Mathilde Sahuguet4, Benoit Huet4, Ivo Lašek5
Enrichment of News Show Videos with
Multimodal Semi-Automatic Analysis
1 Fraunhofer Institute IAIS, Schloss Birlinghoven, Germany
2 Information Technologies Institute, CERTH, Greece
3 rbb - Rundfunk Berlin-Brandenburg, 14482 Potsdam, Germany
4 Eurecom, Sophia Antpolis, France
5 Czech Technical University and University of Economics, Prague, Czech Republic
NEM Summit, Istanbul,
www.linkedtv.eu
October 2012
2. Synopsis
www.linkedtv.eu
Introduction: LinkedTV Project
Use Cases
Intelligent Video Analysis
Results
Conclusions & future plans
2 Information Technologies Institute
Centre for Research and Technology Hellas
3. LinkedTV ― Television Linked To the Web
www.linkedtv.eu
Vision: 12 Excellent Partners
hypervideo Fraunhofer Eurecom
ubiquitously online cloud of STI GMBH Condat
Networked Audio-Visual Content CERTH BEELD EN GELUID
decoupled from place, device or UEP Noterik
source UMONS U. ST GALLEN
CWI RBB
Aim:
provide interactive multimedia
service for non-professional end-
users
Focus on television broadcast
content as seed videos
Web: http://www.linkedtv.eu
3 Information Technologies Institute
Centre for Research and Technology Hellas
4. LinkedTV Workflow
www.linkedtv.eu
Overall Architecture
Use Case Scenarios
Intelligent Video Analysis
Linking Hypervideo to Web Content
Contextualization
and Personalization
Interface and Presentation Engine
4 Information Technologies Institute
Centre for Research and Technology Hellas
5. LinkedTV Workflow
www.linkedtv.eu
Overall Architecture
Use Case Scenarios
Intelligent Video Analysis
Linking Hypervideo to Web Content
Contextualization
and Personalization
Interface and Presentation Engine
5 Information Technologies Institute
Centre for Research and Technology Hellas
6. Two Use Case Scenarios in LinkedTV
www.linkedtv.eu
Scenario 1 (this talk):
Interactive News Show
Professional news
Due to legal constraints: whitelist
Detailed scenario archetype description
content produced by
RBB News topic,
Seed content: local people,
news show "rbb locations,
Aktuell" objects etc
Scenario 2
(not covered here):
Hyperlinked Documentary
Cultural content from
S&V (1700 hours of
cultural heritage AV-
content under CCL)
Seed content:
"Antique Roadshow"
6
6 Information Technologies Institute
Centre for Research and Technology Hellas
7. Intelligent Video Analysis
www.linkedtv.eu
7 Information Technologies Institute
Centre for Research and Technology Hellas
8. Segmentation
www.linkedtv.eu
Shot segmentation technique Spatio-temporal Segmentation
[Tsamoura et. al., 2008] [Mezaris et. al., 2004]
News show video performance: News show performance: Good
“remarkably well” False positives due to:
Out of 269 shots detected: Camera movement or zoom in/out (~ 55 %)
2 had wrong starting points Gradual transition between frames (~ 10 %)
Erroneous motion vectors (~ 35 %)
4 contained multiple shots
11 were too short to evaluate
Unwanted effect: false recognition of
properly moving banners which do not yield
additional information
V. Mezaris, I. Kompatsiaris, N. V. Boulgouris, and M. G. Strintzis, "Real-time compressed-domain spatiotemporal segmentation and ontologies
for video indexing and retrieval", IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 5, pp. 606-621, May 2004.
E. Tsamoura, V. Mezaris, I. Kompatsiaris, "Gradual transition detection using color coherence and other criteria in a video shot meta-
segmentation framework", IEEE International Conference on Image Processing, Workshop on Multimedia Information Retrieval (ICIP-MIR
2008), San Diego, CA, USA, October 2008, pp. 45-48.
8 Information Technologies Institute
Centre for Research and Technology Hellas
9. Concept Detection
www.linkedtv.eu
Method was described in [Moumtzidou et.
al., 2011]
346 concepts from TRECVID 2011 SIN task
Overall performance:
Correctly detected concepts > 64 %
About 25 % of them are characterized as
particularly useful mostly related to detecting
persons (e.g., person, face, adult)
Erroneous concepts vary between 22% -
42% and in many cases achieve high scores
(e.g., outdoor, amateur video)
Visit: http://mklab.iti.gr/eventdetection-linkedtv/
A. Moumtzidou, P. Sidiropoulos, S. Vrochidis, N. Gkalelis, S. Nikolopoulos, V. Mezaris, I. Kompatsiaris, I. Patras, "ITI-CERTH participation
to TRECVID 2011", Proc. TRECVID 2011 Workshop, December 2011, Gaithersburg, MD, USA.
9 Information Technologies Institute
Centre for Research and Technology Hellas
10. Automatic Speech Recognition
www.linkedtv.eu
Automatic speech recognition for German (using [Schneider08]):
segment of one news show WER notes
new airport 36.2 outdoor, spontaneous
soccer riot 44.2 tavern, dialect, background noise
various news I 9.5
murder case 24.0
boxing 50.6 dialect, very spontaneous
various news II 20.9
rbb game 39.1
weather report 46.7 spontaneous, casual
main obstacles: local dialect, spontaneous speech, background noise
Schneider, D., Schon, J., and Eickeler, S. (2008). Towards Large Scale Vocabulary Independent Spoken Term Detection: Advances
in the Fraunhofer IAIS Audiomining System. In Proc. SIGIR, Singapore.
10 Information Technologies Institute
Centre for Research and Technology Hellas
11. Person Detection
www.linkedtv.eu
Face clustering using the face.com api Speaker Identification using a GMM-
Result: generally very good, some HMM model, with 253 German parliament
erroneous clusters due to side-view speakers
Result: 8.0% Equal Error Rate
11 Information Technologies Institute
Centre for Research and Technology Hellas
12. Conclusions
www.linkedtv.eu
We have established:
all the different video analysis techniques
their exact functionality
the connections among them
Preliminary results work as a solid ground for further improvements
Many challenges have been addressed but several aspects of the analysis
techniques show much room for improvement, e.g.,
over-sensitivity of spatiotemporal segmentation algorithm to gradual transitions
and camera’s movement
adaptation of several TRECVID concepts to the needs of each specific multimedia
content (news show, documentary, art show)
over-sensitivity of speech recognizer to localized dialects and background noise
12 Information Technologies Institute
Centre for Research and Technology Hellas
13. Future Plans
www.linkedtv.eu
Incorporate new methods:
Near-duplicate Content Detection
Goal: find parts that are already watched
Optical Character Recognition
Goal: exploit banner information to obtain a database for face and
speaker recognition
Topic Segmentation
Goal: improve scene segmentation
Find synergies between methods:
ASR + Speaker Recognition + Face Detection
Person Detection
ASR + Topic Classification + Shot Segmentation
Story Segmentation
Concept Detection + Keywords Extraction + Topic Segmentation
Video Similarity/Clustering
13 Information Technologies Institute
Centre for Research and Technology Hellas
14. www.linkedtv.eu
Questions ?
More information:
http://www.iti.gr/~bmezaris
bmezaris@iti.gr
http://www.linkedtv.eu
14 Information Technologies Institute
Centre for Research and Technology Hellas