Your SlideShare is downloading. ×
0
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

542

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
542
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Indexing and searching of noisy data Franciska de Jong University of Twente Erasmus Universitycluster Human Media Interaction Erasmus Studio for e-research Enschede, The Netherlands Rotterdam, The Netherlands http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 1
  • 2. OverviewPart I: Noisy data analysis – other examplesPart II: Emerging scenarios of scholarly usePart III: From noisy (meta)data towards metadata mining IMPACT Closing Event - The Hague 2
  • 3. Noisy Channel for Spelling Correction J&M Figure 5.23noise: limitations in spelling skills
  • 4. Noisy Channel for Speech Recognition J&M Figure 9.2noise: limitations in sound captured
  • 5. Noisy Channel for Machine Translation J&M Figure 25.15noise: loss of information through translation
  • 6. Noisy Channel for OCR J&M Figure 5.23noise:loss of information through typesetting/handwriting
  • 7. Decoding spoken audio• Audio modelling: collect data on the ground truth for audio segments• Language modelling: collect data on co- occurrence s of words• 100 hours of speech,• Text data (500 M words)There is no data like more data IMPACT Closing Event - The Hague 7
  • 8. After decoding• multiple hypotheses with varying probabilities of being correct• selection from n-best list: errors unavoidable• post-editing can be an option, but never without extra costs – time (editors), money (editing platform) – complexity of workflow IMPACT Closing Event - The Hague 8
  • 9. Impact of noise on access tasks• Content/metadata with a certain amount of errors• Search with reduced accuracy: – missed hits (false negatives) – incorrect hits (false positive; ‘noise’)• Noisy data less suited for presentation layer – pdf versus ascii – original audio versus transcript; alternatives: word clouds, related content IMPACT Closing Event - The Hague 9
  • 10. Access to interviews: transcript generationmetadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  • 11. Optimization Strategies (1)• Error correction: post-editing, better recognition• Improved recognition – typically effective for core collections (WER below 20%) – less effective for the long tailCase: interviews with Willem Frederik Hermans• With models for news: 81% WER• Aim: reduction to around 60% IMPACT Closing Event - The Hague 11
  • 12. Optimization Strategies (2)• Dedicated /task-specific evaluation – for seach applications errors in function words are less critical than errors in e.g. names of persons and locations• Dedicated weigthing schemes for search tasks – assign confidences scores to fragments found and rerank search results accordingly IMPACT Closing Event - The Hague 12
  • 13. Access to interviews: support for usersmetadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  • 14. • Part II: Emerging scenarios of scholarly use IMPACT Closing Event - The Hague 14
  • 15. DLs and knowledge discovery• Focus of attention for analysis is no longer the document alone.• Room for statistical methods to analyse entire collections, archives, libraries.• Tools that automatically detect and capture various semantic layers and feed the patterns found back into the metadata structures.• Discovery versus item finding: room for serendipity and data-driven content exploration. IMPACT Closing Event - The Hague 15
  • 16. Paradigm evolution Science Information examples studies examples direct obervation interpretation/ decoding ofExperimental textswork E = mc2 S → NP VPTheoretical a2 + b2 = c2 Principle ofmodeling Compositionality change GIS for visualisation ofComputational mobility patterns simulationmodeling text-mining: cross- particle physics, document entity linking forData-intensive cultural heritage libraries astronomycomputing rule-based parsing of large IMPACT Closing Event - The Hague corpora (typology studies)) 16
  • 17. More than search: metadataextraction• For large-scale digital (distributed) collections the potential added value of automatically generated metadata is becoming more and more apparent.• Automatic content labeling: – not just a matter of speeding up the annotation process and enlarging the scope of analysis, also – starting point for generating annotation layers at collection level , and – basis for link structures for all kinds of semantic aspects of content, such as chronological trends, topic shifts, style and authenticity. – potentially noisy IMPACT Closing Event - The Hague 17
  • 18. “Multi”-issues for DL metadata (1)• Multi-layer – beyond tomb stone: content description at fragment level (full text, full content, etc.) – free text annotation versus thesaurus-based labeling• Multiple media formats – text, text, text – spoken audio, video, still images, music, scores, umerical data, sensor data, sensus data, etc. IMPACT Closing Event - The Hague 18
  • 19. Multi-issues for DL metadata (2)• Multiple perspectives – cover more than local context – cover more than one domain perspective – cover more than one language• Multiple values due to uncertainty – multiple human annotators – automatic labeling extracted from potentially noisy data – dynamics in collection/context IMPACT Closing Event - The Hague 19
  • 20. Scholarly use• Comparative perspective – Quantitative and qualitative issues• Need for enhanced content presentation: – Multiple layers – Links to context – Links to related content• Emerging methodological shift – Enhanced collection exploration (think of Google n-grams) IMPACT Closing Event - The Hague 20
  • 21. Part IIIFrom noisy data/metadata towards metadatamining IMPACT Closing Event - The Hague 21
  • 22. Metadata mining: crucial steps• Treat all annotation types (classical metadata, automatically extracted metadata, scholarly annotation, community tagging) as assets.• Learn how to integrate the various types and layers to enhance accessibility and to be able to exploit the knowledge captured in metadata – Exploiting manual annotation for machine learning training – Detection of collection-level semantic features – Innovative interface Event - The Hague IMPACT Closing and interaction design 22
  • 23. What can metadata mining bring?• Quality added to metadata for increased accessibility of content: – structured search (full text + classification-based) – navigation across collections, rich presentation layers• Increased insight in relations between data collections (across media types, languages, etc.)• Increased understanding of knowledge production as captured by metadata and annotation processing• Support for capturing the essence of association and analogy.There is no data like metadata! IMPACT Closing Event - The Hague 23
  • 24. Issues for metadata modelsOld• annotation interoperability (e.g., metadata integration for content annotated with coding tools such as thesauri and ontologies)New• how to capture fuzziness and uncertainty coming from multiple sources and/or statistical processing• coding of change over time (e.g., metadata for the dynamics of temporal and geo-spatial details) IMPACT Closing Event - The Hague 24
  • 25. Issues for scholarly usersIndividual level• Learn to deal with imperfection• Understand the limitations of technological innovationCommunity level• Stay tuned with developers• Organize methodology teaching• Study emerging practises• Share success stories IMPACT Closing Event - The Hague 25
  • 26. Issues for developers• Learn about scholarly practises• Stay tuned with users during the entire process• Organize structured feedback loops• Study best practises• Share responsibility for centers of expertise IMPACT Closing Event - The Hague 26
  • 27. Issues for e-humanities• e-humanities is e-research• multiple media, multiple patforms• keep connecting ! IMPACT Closing Event - The Hague 27
  • 28. Contact• email: f.m.g.dejong@utwente.nl or fdejong@ese.eur.nl• url: http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 28

×