Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

785 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

  1. 1. Indexing and searching of noisy data Franciska de Jong University of Twente Erasmus Universitycluster Human Media Interaction Erasmus Studio for e-research Enschede, The Netherlands Rotterdam, The Netherlands http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 1
  2. 2. OverviewPart I: Noisy data analysis – other examplesPart II: Emerging scenarios of scholarly usePart III: From noisy (meta)data towards metadata mining IMPACT Closing Event - The Hague 2
  3. 3. Noisy Channel for Spelling Correction J&M Figure 5.23noise: limitations in spelling skills
  4. 4. Noisy Channel for Speech Recognition J&M Figure 9.2noise: limitations in sound captured
  5. 5. Noisy Channel for Machine Translation J&M Figure 25.15noise: loss of information through translation
  6. 6. Noisy Channel for OCR J&M Figure 5.23noise:loss of information through typesetting/handwriting
  7. 7. Decoding spoken audio• Audio modelling: collect data on the ground truth for audio segments• Language modelling: collect data on co- occurrence s of words• 100 hours of speech,• Text data (500 M words)There is no data like more data IMPACT Closing Event - The Hague 7
  8. 8. After decoding• multiple hypotheses with varying probabilities of being correct• selection from n-best list: errors unavoidable• post-editing can be an option, but never without extra costs – time (editors), money (editing platform) – complexity of workflow IMPACT Closing Event - The Hague 8
  9. 9. Impact of noise on access tasks• Content/metadata with a certain amount of errors• Search with reduced accuracy: – missed hits (false negatives) – incorrect hits (false positive; ‘noise’)• Noisy data less suited for presentation layer – pdf versus ascii – original audio versus transcript; alternatives: word clouds, related content IMPACT Closing Event - The Hague 9
  10. 10. Access to interviews: transcript generationmetadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  11. 11. Optimization Strategies (1)• Error correction: post-editing, better recognition• Improved recognition – typically effective for core collections (WER below 20%) – less effective for the long tailCase: interviews with Willem Frederik Hermans• With models for news: 81% WER• Aim: reduction to around 60% IMPACT Closing Event - The Hague 11
  12. 12. Optimization Strategies (2)• Dedicated /task-specific evaluation – for seach applications errors in function words are less critical than errors in e.g. names of persons and locations• Dedicated weigthing schemes for search tasks – assign confidences scores to fragments found and rerank search results accordingly IMPACT Closing Event - The Hague 12
  13. 13. Access to interviews: support for usersmetadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  14. 14. • Part II: Emerging scenarios of scholarly use IMPACT Closing Event - The Hague 14
  15. 15. DLs and knowledge discovery• Focus of attention for analysis is no longer the document alone.• Room for statistical methods to analyse entire collections, archives, libraries.• Tools that automatically detect and capture various semantic layers and feed the patterns found back into the metadata structures.• Discovery versus item finding: room for serendipity and data-driven content exploration. IMPACT Closing Event - The Hague 15
  16. 16. Paradigm evolution Science Information examples studies examples direct obervation interpretation/ decoding ofExperimental textswork E = mc2 S → NP VPTheoretical a2 + b2 = c2 Principle ofmodeling Compositionality change GIS for visualisation ofComputational mobility patterns simulationmodeling text-mining: cross- particle physics, document entity linking forData-intensive cultural heritage libraries astronomycomputing rule-based parsing of large IMPACT Closing Event - The Hague corpora (typology studies)) 16
  17. 17. More than search: metadataextraction• For large-scale digital (distributed) collections the potential added value of automatically generated metadata is becoming more and more apparent.• Automatic content labeling: – not just a matter of speeding up the annotation process and enlarging the scope of analysis, also – starting point for generating annotation layers at collection level , and – basis for link structures for all kinds of semantic aspects of content, such as chronological trends, topic shifts, style and authenticity. – potentially noisy IMPACT Closing Event - The Hague 17
  18. 18. “Multi”-issues for DL metadata (1)• Multi-layer – beyond tomb stone: content description at fragment level (full text, full content, etc.) – free text annotation versus thesaurus-based labeling• Multiple media formats – text, text, text – spoken audio, video, still images, music, scores, umerical data, sensor data, sensus data, etc. IMPACT Closing Event - The Hague 18
  19. 19. Multi-issues for DL metadata (2)• Multiple perspectives – cover more than local context – cover more than one domain perspective – cover more than one language• Multiple values due to uncertainty – multiple human annotators – automatic labeling extracted from potentially noisy data – dynamics in collection/context IMPACT Closing Event - The Hague 19
  20. 20. Scholarly use• Comparative perspective – Quantitative and qualitative issues• Need for enhanced content presentation: – Multiple layers – Links to context – Links to related content• Emerging methodological shift – Enhanced collection exploration (think of Google n-grams) IMPACT Closing Event - The Hague 20
  21. 21. Part IIIFrom noisy data/metadata towards metadatamining IMPACT Closing Event - The Hague 21
  22. 22. Metadata mining: crucial steps• Treat all annotation types (classical metadata, automatically extracted metadata, scholarly annotation, community tagging) as assets.• Learn how to integrate the various types and layers to enhance accessibility and to be able to exploit the knowledge captured in metadata – Exploiting manual annotation for machine learning training – Detection of collection-level semantic features – Innovative interface Event - The Hague IMPACT Closing and interaction design 22
  23. 23. What can metadata mining bring?• Quality added to metadata for increased accessibility of content: – structured search (full text + classification-based) – navigation across collections, rich presentation layers• Increased insight in relations between data collections (across media types, languages, etc.)• Increased understanding of knowledge production as captured by metadata and annotation processing• Support for capturing the essence of association and analogy.There is no data like metadata! IMPACT Closing Event - The Hague 23
  24. 24. Issues for metadata modelsOld• annotation interoperability (e.g., metadata integration for content annotated with coding tools such as thesauri and ontologies)New• how to capture fuzziness and uncertainty coming from multiple sources and/or statistical processing• coding of change over time (e.g., metadata for the dynamics of temporal and geo-spatial details) IMPACT Closing Event - The Hague 24
  25. 25. Issues for scholarly usersIndividual level• Learn to deal with imperfection• Understand the limitations of technological innovationCommunity level• Stay tuned with developers• Organize methodology teaching• Study emerging practises• Share success stories IMPACT Closing Event - The Hague 25
  26. 26. Issues for developers• Learn about scholarly practises• Stay tuned with users during the entire process• Organize structured feedback loops• Study best practises• Share responsibility for centers of expertise IMPACT Closing Event - The Hague 26
  27. 27. Issues for e-humanities• e-humanities is e-research• multiple media, multiple patforms• keep connecting ! IMPACT Closing Event - The Hague 27
  28. 28. Contact• email: f.m.g.dejong@utwente.nl or fdejong@ese.eur.nl• url: http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 28

×