AIEMpro 2010: CONTENTUS: Technologies for Next Generation Multimedia Libraries

898 views

Published on

AIEMpro 2010 keynote speech by
Andreas Heß, German National Library
Jan Nandzik, Acosta Consult

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
898
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

AIEMpro 2010: CONTENTUS: Technologies for Next Generation Multimedia Libraries

  1. 1. CONTENTUS Technologies for Next Generation Multimedia LibrariesAIEMPro’10, FirenzeAndreas Heß, German National LibraryJan Nandzik, Acosta Consult
  2. 2. Motivation2
  3. 3. Motivation More than 30 millions hours of audio-visual content are stored in European archives2
  4. 4. Motivation National Libraries contain millions of printed media More than 30 millions hours of audio-visual content are stored in European archives2
  5. 5. Motivation National Libraries contain millions of printed mediaWorld film production hit 5,039 More than 30 millionsfeature films in 2007 and hours of audio-visualrepresented 60% of the audio- content are stored invisual revenues. European archives 2
  6. 6. Next Generation Multimedia Archives (ca. 2015) Überschrift» Everything is digital » Mass processing, administration and provision of multimedia content is day-to-day business • Bullet Points » All media are available in high quality» Always accessible » Access from anywhere, at any time » Resources are not added to knowledge networks – they are created within them» Always up-to-date » Desired information finds the user» Knowledge Journeys » An interactive exploration of cultural and scientific collections » Largely replace traditional access (e.g. search engines) 3
  7. 7. Present Generation Multimedia Archives (ca. 2010)» Still large amounts of non-digitised material » Mass digitisation is still expensive, if quality is important » Problem: Media deterioration» Restriced Access » Only from reading room, even if digital (legal reasons...)» Not necessarily up-to-date » Indexing requires manual and intellectual effort» Search is the paradigm » User must know what he/she is looking for » Different search engines for different collections » Media discontinuity » Situation in present libraries/archives: search engines often slow and outdated 4
  8. 8. Problems on the Way• Deterioration• Digitisation• Metadata• Accessibility 7
  9. 9. Deterioration• Improper storage and handling• Magnetic tape: drop-outs, wow and flutter, ...• Film: dirt, scratches, blotches, ...• Paper: bleaching, acid, mice, ...• Optical discs: coating decay, ... 8
  10. 10. Digitisation • Often lack of quality awareness • Quality is crucial for preserving cultural heritage 9
  11. 11. Digitisation - Problems• Causes for quality issues: ‣ Unsuitable hardware ‣ Unsuitable configuration ‣ Errors during digitisation• Goals: ‣ Automatisation and efficiency ‣ Continuous checks while job is being processed 10
  12. 12. Metadata - Problems• Not always present• Indexing and annotation are time-consuming and error-prone• Incompatibility of different sources of metadata 11
  13. 13. Accessibility - Problems• Current search approaches not really suitable for multi-media content• Search and consumption is separated• Physical presence of media• Data is nothing without meta-data! 12
  14. 14. 13
  15. 15. 13
  16. 16. 13
  17. 17. Our Project• THESEUS ‣ Research intiative in the area of Internet-based technologies, focus: semantic technologies ‣ funded by German Federal Ministry of Economics and Technology ‣ consortium of approx. 60 partners from academia and industry ‣ „application scenario“- and „basic technology“-subprojects• CONTENTUS ‣ application scenario-subproject of THESEUS ‣ concepts and technologies for multimedia-libraries and archives
  18. 18. The CONTENTUS Processing Chain Media-specific Media-independent
  19. 19. Outline• Media-specific processing ‣ Print ‣ Video ‣ Audio• Media-independent processing ‣ Entity recognition and disambiguation ‣ Semantic Linking ‣ Semantic Multi-Media Search 16
  20. 20. Print - Quality Assessment17
  21. 21. Content-aware optimisation Original18
  22. 22. Content-aware optimisation Otsu Sauvola CONTENTUS Approach + Original Binarisation Binarisation Content-specific optimisation18
  23. 23. De-Warping19
  24. 24. Removal of Unwanted Objects20
  25. 25. Page Segmentation• Automatic identification of: ‣ Articles ‣ Headings ‣ Tables of content ‣ Figures / pictures and captions ‣ Bullet Points 21
  26. 26. Video - Quality Assessment• Quality survey essential to determine value of content• Use perceptual no-reference quality metrics• Check for specific image artifacts• Based on human visual models• Use specific restoration modulesRestoration• Current solutions require at least 4 hours of work per hour of video• Automation necessary 22
  27. 27. Video - Restoration23
  28. 28. Video - Restoration Scratches are automatically identified Scratches are automatically removed23
  29. 29. Video - Segmentation News Show „Tagesthemen“ – 22:35:29 Report 1 Report 2 Report Interview Report Interview Sum. Inter. Speaker Speaker24
  30. 30. Video - Annotation tagesthemen• Face detection and annotation Face detection Logo• Prerequisite for further indexing 1st version detection• Text and logo detection Text detection OCR Ulrich Wickert 25
  31. 31. Audio• Segmentation speech / non-speech• Extraction of musical features• Speech transcription• Speaker recognition• Similarity search 26
  32. 32. Media-Independent Processing27
  33. 33. Named Entity Disambiguation• Named entities are extracted from text• Context is used for disambiguation• Compare to reference text, e.g. from Wikipedia 28
  34. 34. Semantic Linking Authority Files Wikipedia MusicBrainz29
  35. 35. Semantic Multimedia Searchprovides• Seamless searching in multimedia• Query expansion / narrowingcombines• Full-text search and Semantic Web Stack (RDF based ontology)integrates• Multiple media (Video, Audio, Text)• Clickable filter facets• Text classification• Similarity search 30
  36. 36. The CONTENTUS-Collection• Digitisation of Music Information Center of the GDR• Now a special collection of the German Music Archive ‣ 1600 books ‣ 200.000 press clippings ‣ 4000 audio records ‣ 10.000 photos• Other media ‣ Newspapers (Neues Deutschland) ‣ News broadcasts ‣ Historical film material 31
  37. 37. Demonstration32
  38. 38. SMMS Demo 1 – Who was Hanns Eisler?• Hanns Eisler, GDR-Composer• Composed the music of the GDR national anthemSMMS-Features: – Multimedia – One unified index – Facetted search ? – Roles of persons – Timeline 33
  39. 39. Summary• Challenges and obstacles on the way to the Next Generation Multimedia Library• Our project ‣ High degree of automation during complete processing chain ‣ Semantic multimedia search engine 34
  40. 40. Conclusion• The next generation library is not here yet• We‘re on the way...• We need your! help 35
  41. 41. Thank you for your attention• Visit THESEUS: http://www.theseus-programm.de/• Mail us: a.hess@d-nb.de jn@acosta-consult.de 36

×