Iasa Presentatie


Published on

Presentation for the 40th aniversary IASA Congress in Athens

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The dense descriptions, generally per hour of audio, lead to large chunks for user exploration when results are found for a given query.
  • The undisclosed part of the collection cannot be accessed, and its content is largely unknown.
  • For ‘disclosure’ the speech technology researchers want “to automatically generate a time-stamped content description”. The automation will reduce the human annotation effort, and the fact that annotations are time-stamped means that words are linked to locations in the audio recording, allowing fragments to be retrieved in addition to entire audiovisual documents. The technology used for disclosure depends on (1) the available metadata, and (2) the availability of context documents, i.e. documents that are either directly related to the recording or to the topic of the recording. When there is a transcript of te recording available, the words in the transcripts can be aligned to the audio. During this process the locations of the known words are determined in the audio signal. The result is a fairly accurate index of which word was said where in the audio. When it is unknown exactly what was said in the audio recording ASR can be used to generate hypotheses of what was said where in the recording. Context documents can be valuable here to improve the models used for speech recognition. Speech recognizers generate output that is generally not without errors, but up to word error rates of 30 to 40% -- that is 3 or 4 out of every ten words were recognized in correctly-- the automatically generated content descriptions may successfully be used as search indexes. This is explained by the fact that speech is redundant, i.e. when something is on-topic it will be referred to more than once, and that many of the words that have a high risk of being mis-recognized make a relatively small contribution to the information content, i.e. prepositions (in, at), determiners (a, the) etc.
  • How does CHoral technology fit into the archiving workflow? This of course is a simplified representation, but it gives a general idea. After content has been produced, it is transferred to the archives for preservation. The data are being stored, archivists index the collection, and users may search the index for recordings of possible interest. <start animation> CHoral uses the recordings and the existing metadata to give the user a new kind of access. In addition to searching the catalogue for recordings that can be listened to at the archive’s listening room, search results come with audio fragments that can be listened to online, e.g., from the searcher’s home or work location. <animation 2> The technology consists of automatic speech recognition for index generation, information retrieval technology for finding relevant audio fragments in the collection, and of new user interface components that support interaction with the audio fragments.
  • During alignment the locations of known words are determined in the speech signal. By matching the acoustics in the speech signal to the expected acoustics of individual words each word in the transcript is matched to the location in the audio where it is most likely to occur. This results in an index that gives exact word positions for each word in the transcript. The accuracy of the resulting index is very high.
  • The following type of speech recognition system is used. Before the actual recognition process is started some pre-processing is done. This consists of (1) classifying the audio document into speech and non-speech segments, so that the parts of the recordings that do not contain speech (e.g., music, street noise) are not fed to the recognition system. Moreover, the speech may be segmented into coherent chunks per speaker, so that models may be adapted to individual speakers. The speech recognition system itself consists of three components: (1) an acoustic model that models the different speech sounds of a languages, (2) a language model that models which sequences of words are likely, and (3) a dictionary that prescribes out of which speech sounds a word is made up. To develop an acoustic model, over 50 hours of annotated speech materials are needed. To develop a language model, texts of hundreds of millions words are used. The output of the ASR system is a word level index, or a hypothesis of which words were spoken where in the audio document. Instead of running the recognition process just once, the output of the first round may be used to better choose the models used during recognition. Therefore, a so-called second pass is often run with adapted models to arrive at a more accurate index.
  • The output of an ASR system can take several forms. The most well-known form of output is in sentences, reflecting the most likely word sequence that was recognized by the system. For indexing purposes, however, other output types should be considered. One candidate are lattice structures that do not only store the most likely word sequency for a certain fragment of audio, but also alternative words that are likely at certain positions. In this way, alternatives are kept available.
  • For successful take up of technology some investments are needed. Thanks to the ongoing digitization process as well as standardization of formats audio documents should increasingly be fit for automatic processing without further adaptations. The quality of automatic annotations depends on the quality of the ASR models, and those can be tuned to different domains by accurate transcriptions of representative samples and/or (large amounts of) text data on the same or a strongly related topic. But when an ASR system is used to automatically generate time-stamped content descriptions, should those descriptions be validated by archivists? And if so, how?
  • A surrogate is a textual or visual represensation of the content of a spoken word document that can be used by searchers to assess a document’s contents before he/she decides to listen to the audio.
  • Iasa Presentatie

    1. 1. Hidden treasures lost forever? Speech technology for the disclosure of Dutch audiovisual archives Mies Langelaar and Willemijn Heeren
    2. 2. Contents <ul><li>Introduction & Problem statement </li></ul><ul><li>Digitization/standardization in the E-repository </li></ul><ul><li>Speech technology for AV archives </li></ul><ul><li>System demonstration </li></ul>
    3. 3. Introduction <ul><li>Hidden treasures of audiovisual archives lost forever? </li></ul><ul><li>Backlog </li></ul><ul><ul><li>Data stored on deteriorating analogue carriers </li></ul></ul><ul><ul><li>Digitized and digital born data in non-standardized formats </li></ul></ul><ul><li> Digitization and international standardization needed </li></ul><ul><li>Often global level of description </li></ul><ul><ul><li>A few keywords data unit (hour, tape, interview) </li></ul></ul><ul><ul><li>Often no content description at all, because annotation is very (time-)costly </li></ul></ul><ul><li> Reduce human effort through use of speech technology? </li></ul>
    4. 4. The approach <ul><li>NWO CATCH project CHoral (2006-2010) </li></ul><ul><li>Goal: </li></ul><ul><li>investigate and develop automatic annotation and search technology for spoken word archives </li></ul><ul><li>Cooperation between </li></ul><ul><li>speech technology researchers, University of Twente </li></ul><ul><li>archivists, Rotterdam Municipal Archives </li></ul>
    5. 5. The test case <ul><li>‘ Radio Rijnmond’ (RR) archives </li></ul><ul><ul><li>city of Rotterdam's regional radio channel </li></ul></ul><ul><ul><li>initial broadcast in 1983 </li></ul></ul><ul><ul><li>broadcast recordings, amounting to over 60.000 hours </li></ul></ul><ul><ul><li>partially digitized, mostly analog </li></ul></ul><ul><ul><li>partially disclosed, mostly waiting for annotation </li></ul></ul><ul><ul><li>typical of A/V archives in cultural heritage (CH) </li></ul></ul>
    6. 6. Searching the RR archives I Minimal content descriptions per hour data
    7. 7. Searching the RR archives II ? ? ? ? ?
    8. 8. Main problems <ul><li>The main problems with this example collection are: </li></ul><ul><li>1. a large backlog of undisclosed material  data are inaccessible for third parties </li></ul><ul><li>2. fairly unspecific annotations, if available </li></ul><ul><li> restricted use for answering information needs </li></ul><ul><li>3. audio is being kept on analog data carriers or on CDs </li></ul><ul><li> interactive or online search cannot be supported </li></ul>
    9. 9. Towards solutions … <ul><li>Digitization/standardization in the E-repository </li></ul><ul><li>Speech technology for AV archives </li></ul>
    10. 10. <ul><li>Digitization/standardization </li></ul><ul><li>in the E-repository </li></ul>
    11. 11. AV Collection of Rotterdam Municipal Archives <ul><li>About 15.000 AV objects in collection </li></ul><ul><li>Most of this collection is on analog data </li></ul><ul><li>carriers </li></ul><ul><li>Part of the collection is on CD’s, dating form the 1980’s onwards </li></ul><ul><li>No standardisation in storage formats </li></ul><ul><li>No or minimal metadata and description of content available </li></ul>
    12. 12. Work in progress <ul><li>Digitisation of the analogue audio material is done in company </li></ul><ul><li>Standard formats that are used are: </li></ul><ul><ul><li>.WAV for uncompressed PCM audio </li></ul></ul><ul><ul><li>44.1 KHZ 16 bit stereo for audio CDs </li></ul></ul><ul><li>that are already digitised, but need preservation </li></ul><ul><ul><li>48 KHZ 24 bit stereo for old recordings </li></ul></ul><ul><ul><li>Digitally produced audio is accepted in its own format </li></ul></ul><ul><li>Access to the objects is granted by audio CD or MP3 </li></ul>
    13. 13. Work in progress (2) <ul><li>Digitisation of Video and Film is done partly in company, partly by external partners </li></ul><ul><li>The standards that are used are: </li></ul><ul><ul><li>Minimal data rate of 50 Mb for conservational purposes </li></ul></ul><ul><ul><li>Digital Betacam for VHS and Umatic tapes </li></ul></ul><ul><ul><li>Digital Video is accepted in its original recording format (miniDV, DVCam, XDCam etc) </li></ul></ul><ul><ul><li>Digibeta for 8mm, 16mm and 35mm film (processed by external partners) </li></ul></ul><ul><ul><li>DV25 for 16 mm film (processed in company) </li></ul></ul><ul><li>Digital Betacam is stored in 10 bit uncompressed </li></ul>
    14. 14. How to ensure long term sustainability <ul><li>Set up a trusted digital repository, consisting of hardware, software, procedures, methods, knowledge and experience </li></ul>
    15. 15. Trusted Digital Repository Feeder System Workflow Job Queue File Storage Characterisation Preservation Planning Migration Technical Registry Active Preservation Data Management Access Reporting Storage Adaptor Passive Preservation Ingest Toolkit Preservation Controller Workflow Controller User Administrator Archivist Metadata Store
    16. 16. <ul><li>Adding a minimal set of metadata, necessary for management, preservation and access </li></ul><ul><li>Using standard archival formats </li></ul><ul><li>Making agreements with producers of AV material about acceptable formats </li></ul><ul><li>Disclosure of content through Automatic Speech Recognition (ASR) </li></ul>How to ensure long term access to data?
    17. 17. <ul><li>Speech technology for AV archives </li></ul>
    18. 18. Disclosure through speech technology <ul><li>Disclosure: automatically generate a time-stamped content description </li></ul><ul><li> Allows online retrieval of fragments of AV records </li></ul><ul><li>Method depends on: </li></ul><ul><ul><li>Available metadata </li></ul></ul><ul><ul><li>Availability of context documents </li></ul></ul><ul><ul><li>When a transcript is available: </li></ul></ul><ul><ul><ul><li>Speech and transcript can be aligned, </li></ul></ul></ul><ul><li>i.e. Automatically couple what was said in the transcript, to where it was said in the audio </li></ul><ul><ul><li>When there is no transcript: </li></ul></ul><ul><ul><ul><li>Use automatic speech recognition to generate hypotheses of what was said where in the audio </li></ul></ul></ul><ul><ul><ul><li>Word Error Rates under 40% allow automatically generated content descriptions to be used as search index </li></ul></ul></ul>
    19. 19. AV archiving workflow CHoral Content production Indexing <ul><li>Research topics </li></ul><ul><li>ASR : Automatic Indexing </li></ul><ul><li>IR: Information Retrieval </li></ul><ul><li>UI: User Interface Development </li></ul>End user ASR IR UI
    20. 20. Research <ul><li>Automatic indexing through speech technology: </li></ul><ul><ul><li>Development of robust automatic speech recognition and audio classification tools </li></ul></ul><ul><li>Information Retrieval: </li></ul><ul><ul><li>Retrieval of spoken documents based on ASR output </li></ul></ul><ul><ul><li>Bridging the semantic gap between user queries and spoken content </li></ul></ul><ul><li>User Interface development: </li></ul><ul><ul><li>Support search and browsing in audio documents </li></ul></ul><ul><ul><li>(Re)presentation of audio content </li></ul></ul>
    21. 21. Alignment + Speech signal Typed transcript Landgenooten waar ik enkele Begin frame # End frame # Word 00000 54400 -silence- 54400 65280 Landgenooten 65280 69120 Waar 69120 73600 Ik 73600 79520 Enkele … … …
    22. 22. Automatic speech recognition Acoustic model Language model Pronunciation dictionary Speech recognition 50+ hour audio 250-500 M words Pre-processing Classification speech/nonspeech Segmentation of speakers 2 nd recognition with adapted models Word level index
    23. 23. Types of word level indexes <ul><li>Most probable words: </li></ul><ul><li>Lattice structures: </li></ul>ASR: Er is een bekend beeld voor veel ouders de grote show in onveilige situatie voor de school TXT: ‘t is een bekend beeld voor veel ouders. De chaotische en onveilige situatie voor de school “ D’66 is z’n ene zetel kwijt”
    24. 24. Discussion ASR <ul><li>For successful automatic annotation: </li></ul><ul><ul><li>Audio should be digitally available, preferably on a server </li></ul></ul><ul><ul><li>To optimize ASR models for high-quality output, </li></ul></ul><ul><ul><ul><li>part of the speech should be transcribed, </li></ul></ul></ul><ul><ul><ul><li>or related documents should be available </li></ul></ul></ul><ul><li>? How to validate automatic indexes? </li></ul>
    25. 25. User interface development <ul><li>Challenges </li></ul><ul><li>Understand users’ requirements and information needs </li></ul><ul><li>Support selection and browsing of spoken content </li></ul><ul><ul><li>Representation of spoken content via ‘surrogate’ </li></ul></ul><ul><li>Cross-linking to related content within the same or from another collection </li></ul><ul><li>IPR issues </li></ul>
    26. 26. CHoral speech technology for GAR <ul><li>Alignment: Brandgrens interviews Rotterdam </li></ul><ul><li>Speech recognition: RR archives </li></ul>
    27. 27. Disc ussion <ul><li>Development is ongoing in the work-flow and daily practice at audiovisual archives, and speech technology </li></ul><ul><li>Careful tuning of processes is needed for mutual benefit </li></ul><ul><li>Examples demonstrate envisioned benefits: </li></ul><ul><ul><li>Potential reduction of human effort for annotation of undisclosed materials </li></ul></ul><ul><ul><li>Online access to fragments of spoken heritage </li></ul></ul>
    28. 28. <ul><li>For more information, see http :// hmi.ewi.utwente.nl /project/CHoral </li></ul><ul><li>Questions? </li></ul>