Indexing and searching        of noisy data              Franciska de Jong           University of Twente           Erasmu...
OverviewPart I: Noisy data analysis – other examplesPart II: Emerging scenarios of scholarly usePart III: From noisy (meta...
Noisy Channel for Spelling Correction                                        J&M Figure 5.23noise: limitations in spelling...
Noisy Channel for Speech Recognition                                       J&M Figure 9.2noise: limitations in sound captu...
Noisy Channel for Machine Translation                                                 J&M Figure 25.15noise: loss of infor...
Noisy Channel for OCR                                             J&M Figure 5.23noise:loss of information through typeset...
Decoding spoken audio• Audio modelling: collect data on the ground  truth for audio segments• Language modelling: collect ...
After decoding• multiple hypotheses with varying probabilities  of being correct• selection from n-best list: errors unavo...
Impact of noise on access tasks• Content/metadata with a certain amount of  errors• Search with reduced accuracy:  – misse...
Access to interviews: transcript generationmetadata                 multimedia                          interview         ...
Optimization Strategies (1)• Error correction: post-editing, better  recognition• Improved recognition  – typically effect...
Optimization Strategies (2)• Dedicated /task-specific evaluation  – for seach applications errors in function words are   ...
Access to interviews: support for usersmetadata                 multimedia                          interview             ...
• Part II: Emerging scenarios of scholarly use                  IMPACT Closing Event - The Hague   14
DLs and knowledge discovery• Focus of attention for analysis is no longer the  document alone.• Room for statistical metho...
Paradigm evolution                 Science                             Information                 examples               ...
More than search: metadataextraction• For large-scale digital (distributed) collections the  potential added value of auto...
“Multi”-issues for DL metadata (1)• Multi-layer  – beyond tomb stone: content description at    fragment level (full text,...
Multi-issues for DL metadata (2)• Multiple perspectives  – cover more than local context  – cover more than one domain per...
Scholarly use• Comparative perspective  – Quantitative and qualitative issues• Need for enhanced content presentation:  – ...
Part IIIFrom noisy data/metadata towards metadatamining              IMPACT Closing Event - The Hague   21
Metadata mining: crucial steps• Treat all annotation types (classical  metadata, automatically extracted  metadata, schola...
What can metadata mining bring?• Quality added to metadata for increased accessibility  of content:   – structured search ...
Issues for metadata modelsOld• annotation interoperability (e.g., metadata  integration for content annotated with coding ...
Issues for scholarly usersIndividual level• Learn to deal with imperfection• Understand the limitations of technological  ...
Issues for developers• Learn about scholarly practises• Stay tuned with users during the entire  process• Organize structu...
Issues for e-humanities• e-humanities is e-research• multiple media, multiple patforms• keep connecting !                 ...
Contact• email:  f.m.g.dejong@utwente.nl or  fdejong@ese.eur.nl• url:     http://hmi.ewi.utwente.nl/~fdejong              ...
Upcoming SlideShare
Loading in …5
×

IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

689 views
619 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
689
On SlideShare
0
From Embeds
0
Number of Embeds
96
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

  1. 1. Indexing and searching of noisy data Franciska de Jong University of Twente Erasmus Universitycluster Human Media Interaction Erasmus Studio for e-research Enschede, The Netherlands Rotterdam, The Netherlands http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 1
  2. 2. OverviewPart I: Noisy data analysis – other examplesPart II: Emerging scenarios of scholarly usePart III: From noisy (meta)data towards metadata mining IMPACT Closing Event - The Hague 2
  3. 3. Noisy Channel for Spelling Correction J&M Figure 5.23noise: limitations in spelling skills
  4. 4. Noisy Channel for Speech Recognition J&M Figure 9.2noise: limitations in sound captured
  5. 5. Noisy Channel for Machine Translation J&M Figure 25.15noise: loss of information through translation
  6. 6. Noisy Channel for OCR J&M Figure 5.23noise:loss of information through typesetting/handwriting
  7. 7. Decoding spoken audio• Audio modelling: collect data on the ground truth for audio segments• Language modelling: collect data on co- occurrence s of words• 100 hours of speech,• Text data (500 M words)There is no data like more data IMPACT Closing Event - The Hague 7
  8. 8. After decoding• multiple hypotheses with varying probabilities of being correct• selection from n-best list: errors unavoidable• post-editing can be an option, but never without extra costs – time (editors), money (editing platform) – complexity of workflow IMPACT Closing Event - The Hague 8
  9. 9. Impact of noise on access tasks• Content/metadata with a certain amount of errors• Search with reduced accuracy: – missed hits (false negatives) – incorrect hits (false positive; ‘noise’)• Noisy data less suited for presentation layer – pdf versus ascii – original audio versus transcript; alternatives: word clouds, related content IMPACT Closing Event - The Hague 9
  10. 10. Access to interviews: transcript generationmetadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  11. 11. Optimization Strategies (1)• Error correction: post-editing, better recognition• Improved recognition – typically effective for core collections (WER below 20%) – less effective for the long tailCase: interviews with Willem Frederik Hermans• With models for news: 81% WER• Aim: reduction to around 60% IMPACT Closing Event - The Hague 11
  12. 12. Optimization Strategies (2)• Dedicated /task-specific evaluation – for seach applications errors in function words are less critical than errors in e.g. names of persons and locations• Dedicated weigthing schemes for search tasks – assign confidences scores to fragments found and rerank search results accordingly IMPACT Closing Event - The Hague 12
  13. 13. Access to interviews: support for usersmetadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  14. 14. • Part II: Emerging scenarios of scholarly use IMPACT Closing Event - The Hague 14
  15. 15. DLs and knowledge discovery• Focus of attention for analysis is no longer the document alone.• Room for statistical methods to analyse entire collections, archives, libraries.• Tools that automatically detect and capture various semantic layers and feed the patterns found back into the metadata structures.• Discovery versus item finding: room for serendipity and data-driven content exploration. IMPACT Closing Event - The Hague 15
  16. 16. Paradigm evolution Science Information examples studies examples direct obervation interpretation/ decoding ofExperimental textswork E = mc2 S → NP VPTheoretical a2 + b2 = c2 Principle ofmodeling Compositionality change GIS for visualisation ofComputational mobility patterns simulationmodeling text-mining: cross- particle physics, document entity linking forData-intensive cultural heritage libraries astronomycomputing rule-based parsing of large IMPACT Closing Event - The Hague corpora (typology studies)) 16
  17. 17. More than search: metadataextraction• For large-scale digital (distributed) collections the potential added value of automatically generated metadata is becoming more and more apparent.• Automatic content labeling: – not just a matter of speeding up the annotation process and enlarging the scope of analysis, also – starting point for generating annotation layers at collection level , and – basis for link structures for all kinds of semantic aspects of content, such as chronological trends, topic shifts, style and authenticity. – potentially noisy IMPACT Closing Event - The Hague 17
  18. 18. “Multi”-issues for DL metadata (1)• Multi-layer – beyond tomb stone: content description at fragment level (full text, full content, etc.) – free text annotation versus thesaurus-based labeling• Multiple media formats – text, text, text – spoken audio, video, still images, music, scores, umerical data, sensor data, sensus data, etc. IMPACT Closing Event - The Hague 18
  19. 19. Multi-issues for DL metadata (2)• Multiple perspectives – cover more than local context – cover more than one domain perspective – cover more than one language• Multiple values due to uncertainty – multiple human annotators – automatic labeling extracted from potentially noisy data – dynamics in collection/context IMPACT Closing Event - The Hague 19
  20. 20. Scholarly use• Comparative perspective – Quantitative and qualitative issues• Need for enhanced content presentation: – Multiple layers – Links to context – Links to related content• Emerging methodological shift – Enhanced collection exploration (think of Google n-grams) IMPACT Closing Event - The Hague 20
  21. 21. Part IIIFrom noisy data/metadata towards metadatamining IMPACT Closing Event - The Hague 21
  22. 22. Metadata mining: crucial steps• Treat all annotation types (classical metadata, automatically extracted metadata, scholarly annotation, community tagging) as assets.• Learn how to integrate the various types and layers to enhance accessibility and to be able to exploit the knowledge captured in metadata – Exploiting manual annotation for machine learning training – Detection of collection-level semantic features – Innovative interface Event - The Hague IMPACT Closing and interaction design 22
  23. 23. What can metadata mining bring?• Quality added to metadata for increased accessibility of content: – structured search (full text + classification-based) – navigation across collections, rich presentation layers• Increased insight in relations between data collections (across media types, languages, etc.)• Increased understanding of knowledge production as captured by metadata and annotation processing• Support for capturing the essence of association and analogy.There is no data like metadata! IMPACT Closing Event - The Hague 23
  24. 24. Issues for metadata modelsOld• annotation interoperability (e.g., metadata integration for content annotated with coding tools such as thesauri and ontologies)New• how to capture fuzziness and uncertainty coming from multiple sources and/or statistical processing• coding of change over time (e.g., metadata for the dynamics of temporal and geo-spatial details) IMPACT Closing Event - The Hague 24
  25. 25. Issues for scholarly usersIndividual level• Learn to deal with imperfection• Understand the limitations of technological innovationCommunity level• Stay tuned with developers• Organize methodology teaching• Study emerging practises• Share success stories IMPACT Closing Event - The Hague 25
  26. 26. Issues for developers• Learn about scholarly practises• Stay tuned with users during the entire process• Organize structured feedback loops• Study best practises• Share responsibility for centers of expertise IMPACT Closing Event - The Hague 26
  27. 27. Issues for e-humanities• e-humanities is e-research• multiple media, multiple patforms• keep connecting ! IMPACT Closing Event - The Hague 27
  28. 28. Contact• email: f.m.g.dejong@utwente.nl or fdejong@ese.eur.nl• url: http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 28

×